Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value.
Rusty Bargain is interested in:
- the quality of the prediction;
- the speed of the prediction;
- the time required for training
Target = price
Environment Setup & Required Libraries¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import time
import gc
Data preparation¶
df = pd.read_csv("/datasets/car_data.csv")
# Inspect dataset
df1 = df.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 354369 entries, 0 to 354368 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DateCrawled 354369 non-null object 1 Price 354369 non-null int64 2 VehicleType 316879 non-null object 3 RegistrationYear 354369 non-null int64 4 Gearbox 334536 non-null object 5 Power 354369 non-null int64 6 Model 334664 non-null object 7 Mileage 354369 non-null int64 8 RegistrationMonth 354369 non-null int64 9 FuelType 321474 non-null object 10 Brand 354369 non-null object 11 NotRepaired 283215 non-null object 12 DateCreated 354369 non-null object 13 NumberOfPictures 354369 non-null int64 14 PostalCode 354369 non-null int64 15 LastSeen 354369 non-null object dtypes: int64(7), object(9) memory usage: 43.3+ MB
Standardize Columns¶
df.columns = df.columns.str.lower()
Is there a reason we are randomly filling this with with wrangler?
Details to Help with Data Cleaning¶
General
- 1769: Steam Wagon (Nicolas-Joseph Cugnot, France) Steam-powered, heavy, experimental — not practical
- 1800s: Steam carriages - Small numbers in UK & France, for private roads
- 1830s–1890s: Electric vehicles - Short-range city vehicles, mostly experimental or low-volume
- The first gasoline car was made as early as 1885
- The first car to receive registration was on August 14th, 1893
Automation of Vehicle History
- 1904: Sturtevant Automatic Automobile
- 1939/1940: Cadillac & Oldsmobile w/ Hydra-Matic by General Motors
- 1941: Buick (military - WWII civilian car production halt (1942)) - Chrysler Fluid Drive / Vacamatic / Prestomatic
- 1948: Buick Roadmaster / Dynaflow (1949)
- 1950: Powerglide by Chevrolet
- 1961: K4A Mercedes-Benz
- most Cadillac, Oldsmobile, Buick, and Chrysler
- 1962 - : Automatics rapidly expanded
First Car by Model (Earliest Registration Year)
- Rover: 1885
- Mercedes-Benz: 1886
- Peugeot: 1889
- Opel: 1899
- Renault: 1899
- Fiat: 1899
- Ford: 1903
- Škoda: 1905
- Lancia: 1906
- Daihatsu: 1907
- Suzuki: 1909
- Audi: 1910
- Alfa Romeo: 1910
- Chevrolet: 1911
- Mitsubishi: 1917
- Citroën: 1919
- Jaguar: 1922
- Chrysler: 1924
- Volvo: 1927
- BMW: 1928
- Mazda: 1931
- Porsche: 1931
- Nissan: 1933
- Toyota: 1936
- Volkswagen: 1937
- Jeep: 1941
- Kia: 1944
- Saab: 1947
- Honda: 1948
- Land Rover: 1948
- SEAT: 1950
- Subaru: 1954
- Trabant: 1957
- Mini: 1959
- Dacia: 1966
- Daewoo: 1967
- Hyundai: 1967
- Lada: 1970
- Smart: 1998
- Sonstige_autos: N/A (miscellaneous)
# Look at years before 1885 and after 2025
df[(df["registrationyear"] > 2025) | (df["registrationyear"] < 1885)]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 622 | 16/03/2016 16:55 | 0 | NaN | 1111 | NaN | 0 | NaN | 5000 | 0 | NaN | opel | NaN | 16/03/2016 00:00 | 0 | 44628 | 20/03/2016 16:44 |
| 12946 | 29/03/2016 18:39 | 49 | NaN | 5000 | NaN | 0 | golf | 5000 | 12 | NaN | volkswagen | NaN | 29/03/2016 00:00 | 0 | 74523 | 06/04/2016 04:16 |
| 15147 | 14/03/2016 00:52 | 0 | NaN | 9999 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 13/03/2016 00:00 | 0 | 32689 | 21/03/2016 23:46 |
| 15870 | 02/04/2016 11:55 | 1700 | NaN | 3200 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 02/04/2016 00:00 | 0 | 33649 | 06/04/2016 09:46 |
| 16062 | 29/03/2016 23:42 | 190 | NaN | 1000 | NaN | 0 | mondeo | 5000 | 0 | NaN | ford | NaN | 29/03/2016 00:00 | 0 | 47166 | 06/04/2016 10:44 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340548 | 02/04/2016 17:44 | 0 | NaN | 3500 | manual | 75 | NaN | 5000 | 3 | petrol | sonstige_autos | NaN | 02/04/2016 00:00 | 0 | 96465 | 04/04/2016 15:17 |
| 340759 | 04/04/2016 23:55 | 700 | NaN | 1600 | manual | 1600 | a3 | 150000 | 4 | petrol | audi | no | 04/04/2016 00:00 | 0 | 86343 | 05/04/2016 06:44 |
| 341791 | 28/03/2016 17:37 | 1 | NaN | 3000 | NaN | 0 | zafira | 5000 | 0 | NaN | opel | NaN | 28/03/2016 00:00 | 0 | 26624 | 02/04/2016 22:17 |
| 348830 | 22/03/2016 00:38 | 1 | NaN | 1000 | NaN | 1000 | NaN | 150000 | 0 | NaN | sonstige_autos | NaN | 21/03/2016 00:00 | 0 | 41472 | 05/04/2016 14:18 |
| 351682 | 12/03/2016 00:57 | 11500 | NaN | 1800 | NaN | 16 | other | 5000 | 6 | petrol | fiat | NaN | 11/03/2016 00:00 | 0 | 16515 | 05/04/2016 19:47 |
171 rows × 16 columns
First Car by Model (Earliest Registration Year)
- Rover: 1885
- Mercedes-Benz: 1886
- Peugeot: 1889
- Opel: 1899
- Renault: 1899
- Fiat: 1899
- Ford: 1903
- Škoda: 1905
- Lancia: 1906
- Daihatsu: 1907
- Suzuki: 1909
- Audi: 1910
- Alfa Romeo: 1910
- Chevrolet: 1911
- Mitsubishi: 1917
- Citroën: 1919
- Jaguar: 1922
- Chrysler: 1924
- Volvo: 1927
- BMW: 1928
- Mazda: 1931
- Porsche: 1931
- Nissan: 1933
- Toyota: 1936
- Volkswagen: 1937
- Jeep: 1941
- Kia: 1944
- Saab: 1947
- Honda: 1948
- Land Rover: 1948
- SEAT: 1950
- Subaru: 1954
- Trabant: 1957
- Mini: 1959
- Dacia: 1966
- Daewoo: 1967
- Hyundai: 1967
- Lada: 1970
- Smart: 1998
- Sonstige_autos: N/A (miscellaneous)
Brands that do not have registration dates before earliest record
- Lada
- Daewoo
- Dacia
- Mini
- SEAT
- Land Rover
- Honda
- Saab
- Kia
- Nissan
- Porsche
- Mazda
- Jaguar
- Chrysler
- Volvo
- Rover
- Mercedes-Benz
- Peugeot
- Opel
- Renault
- Fiat
- Ford
- Škoda
- Lancia
- Daihatsu
- Suzuki
- Audi
- Alfa Romeo
- Chevrolet
# Look at smart cars registered before 1998
df[(df['brand'] == 'smart') & (df['registrationyear'] < 1998)]
smartnan = (df['brand'] == 'smart') & (df['registrationyear'] < 1998) & (df['model'].isna())
df.loc[smartnan,['registrationyear']] = np.nan
# Hyundai before 1967 implausible
hyundai = (df['brand'] == 'hyundai') & (df['registrationyear'] < 1967) & (df['model'].isna())
df.loc[hyundai, ['registrationyear']] = np.nan
remaining = ['smart', 'hyundai', 'mitsubishi', 'citroen', 'bmw', 'toyota', 'volkswagen', 'jeep', 'subaru', 'trabant']
earliest_years = {'smart': 1998, 'hyundai': 1967,'mitsubishi': 1917, 'citroen': 1919, 'bmw': 1928, 'toyota': 1936,
'volkswagen': 1937, 'jeep': 1941, 'subaru': 1954, 'trabant': 1957}
for brands in remaining:
df.loc[(df['brand'] == brands) & (df['registrationyear'] < earliest_years[brands]) &
df['model'].isna(), ['registrationyear']] = np.nan
display(df[(df['brand'] == brands) & (df['registrationyear'] < earliest_years[brands])])
# Kaefer's is the german name for beetle - they are the same car
beetle = df['model'] == 'kaefer'
df.loc[beetle, ['model']] = 'beetle'
del smartnan,hyundai,beetle
gc.collect()
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31212 | 12/03/2016 16:45 | 700 | small | 1997.0 | NaN | 0 | forfour | 5000 | 3 | petrol | smart | NaN | 12/03/2016 00:00 | 0 | 88416 | 07/04/2016 06:17 |
| 161667 | 04/04/2016 20:56 | 1650 | small | 1992.0 | auto | 55 | fortwo | 100000 | 7 | petrol | smart | no | 04/04/2016 00:00 | 0 | 28327 | 06/04/2016 23:44 |
| 319739 | 05/04/2016 20:36 | 1650 | small | 1992.0 | NaN | 0 | fortwo | 100000 | 6 | NaN | smart | yes | 05/04/2016 00:00 | 0 | 28327 | 05/04/2016 20:36 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 244840 | 09/03/2016 17:50 | 0 | NaN | 1910.0 | NaN | 0 | other | 5000 | 0 | NaN | hyundai | NaN | 09/03/2016 00:00 | 0 | 59510 | 07/04/2016 10:44 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 154559 | 03/04/2016 12:40 | 0 | small | 1910.0 | manual | 0 | colt | 150000 | 0 | petrol | mitsubishi | NaN | 03/04/2016 00:00 | 0 | 46397 | 07/04/2016 14:57 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 125577 | 15/03/2016 18:38 | 7750 | NaN | 1001.0 | NaN | 0 | other | 5000 | 0 | NaN | citroen | NaN | 15/03/2016 00:00 | 0 | 66706 | 06/04/2016 18:47 |
| 270911 | 23/03/2016 11:48 | 0 | other | 1910.0 | manual | 0 | other | 5000 | 0 | petrol | citroen | no | 23/03/2016 00:00 | 0 | 98630 | 23/03/2016 11:48 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58883 | 15/03/2016 21:57 | 1 | NaN | 1910.0 | NaN | 0 | 3er | 150000 | 0 | NaN | bmw | NaN | 15/03/2016 00:00 | 0 | 74074 | 07/04/2016 07:17 |
| 119442 | 18/03/2016 10:37 | 1 | NaN | 1000.0 | NaN | 1000 | 3er | 5000 | 0 | NaN | bmw | NaN | 18/03/2016 00:00 | 0 | 94086 | 05/04/2016 22:16 |
| 203230 | 01/04/2016 15:37 | 400 | NaN | 1910.0 | manual | 170 | 3er | 5000 | 0 | NaN | bmw | NaN | 01/04/2016 00:00 | 0 | 66333 | 03/04/2016 11:48 |
| 213499 | 08/03/2016 12:06 | 380 | NaN | 1000.0 | NaN | 0 | 6er | 5000 | 0 | NaN | bmw | NaN | 08/03/2016 00:00 | 0 | 35102 | 06/04/2016 00:16 |
| 287304 | 09/03/2016 15:54 | 500 | NaN | 1602.0 | manual | 0 | other | 5000 | 0 | NaN | bmw | yes | 09/03/2016 00:00 | 0 | 30900 | 10/03/2016 12:17 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen |
|---|
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23750 | 16/03/2016 19:58 | 3900 | wagon | 1910.0 | manual | 90 | passat | 150000 | 0 | petrol | volkswagen | NaN | 16/03/2016 00:00 | 0 | 88662 | 07/04/2016 05:45 |
| 35943 | 19/03/2016 10:57 | 200 | other | 1910.0 | NaN | 0 | caddy | 150000 | 0 | gasoline | volkswagen | NaN | 19/03/2016 00:00 | 0 | 35096 | 20/03/2016 18:10 |
| 40133 | 23/03/2016 18:00 | 0 | NaN | 1910.0 | NaN | 0 | other | 5000 | 0 | NaN | volkswagen | NaN | 23/03/2016 00:00 | 0 | 85045 | 23/03/2016 18:41 |
| 53577 | 20/03/2016 11:44 | 330 | NaN | 1000.0 | NaN | 0 | polo | 5000 | 0 | NaN | volkswagen | NaN | 20/03/2016 00:00 | 0 | 45259 | 04/04/2016 08:17 |
| 56241 | 30/03/2016 18:54 | 950 | NaN | 1400.0 | manual | 1400 | golf | 125000 | 4 | petrol | volkswagen | NaN | 30/03/2016 00:00 | 0 | 50389 | 03/04/2016 09:45 |
| 62803 | 07/03/2016 22:58 | 3400 | small | 1910.0 | manual | 90 | beetle | 90000 | 4 | NaN | volkswagen | no | 07/03/2016 00:00 | 0 | 34308 | 12/03/2016 08:16 |
| 71459 | 27/03/2016 23:46 | 500 | NaN | 1000.0 | NaN | 0 | golf | 5000 | 0 | NaN | volkswagen | NaN | 27/03/2016 00:00 | 0 | 91628 | 29/03/2016 13:46 |
| 74814 | 21/03/2016 12:52 | 400 | NaN | 1910.0 | NaN | 60 | golf | 150000 | 0 | petrol | volkswagen | NaN | 21/03/2016 00:00 | 0 | 29462 | 25/03/2016 09:17 |
| 143621 | 17/03/2016 23:40 | 550 | NaN | 1000.0 | NaN | 1000 | golf | 5000 | 6 | petrol | volkswagen | NaN | 17/03/2016 00:00 | 0 | 91732 | 26/03/2016 05:18 |
| 144388 | 09/03/2016 20:52 | 50 | NaN | 1910.0 | NaN | 0 | kaefer | 5000 | 0 | NaN | volkswagen | NaN | 09/03/2016 00:00 | 0 | 50374 | 05/04/2016 18:46 |
| 147663 | 03/04/2016 19:37 | 0 | NaN | 1910.0 | NaN | 0 | polo | 5000 | 0 | NaN | volkswagen | NaN | 03/04/2016 00:00 | 0 | 2826 | 05/04/2016 20:15 |
| 151280 | 05/04/2016 00:39 | 300 | NaN | 1910.0 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 04/04/2016 00:00 | 0 | 89269 | 05/04/2016 05:42 |
| 164397 | 29/03/2016 17:49 | 0 | NaN | 1000.0 | NaN | 0 | transporter | 5000 | 1 | NaN | volkswagen | NaN | 29/03/2016 00:00 | 0 | 29351 | 06/04/2016 03:45 |
| 174893 | 05/03/2016 19:48 | 0 | NaN | 1000.0 | NaN | 1000 | golf | 5000 | 4 | petrol | volkswagen | NaN | 05/03/2016 00:00 | 0 | 35716 | 05/03/2016 22:27 |
| 183727 | 03/04/2016 12:48 | 0 | bus | 1910.0 | NaN | 0 | transporter | 5000 | 0 | NaN | volkswagen | NaN | 03/04/2016 00:00 | 0 | 84478 | 03/04/2016 12:48 |
| 189722 | 29/03/2016 16:56 | 1500 | NaN | 1000.0 | manual | 0 | kaefer | 5000 | 0 | petrol | volkswagen | NaN | 29/03/2016 00:00 | 0 | 48324 | 31/03/2016 10:15 |
| 203985 | 07/03/2016 14:53 | 222 | NaN | 1910.0 | manual | 0 | golf | 5000 | 0 | petrol | volkswagen | NaN | 07/03/2016 00:00 | 0 | 26802 | 12/03/2016 04:15 |
| 218241 | 16/03/2016 12:46 | 7999 | NaN | 1800.0 | NaN | 290 | golf | 5000 | 6 | NaN | volkswagen | NaN | 16/03/2016 00:00 | 0 | 15827 | 29/03/2016 20:47 |
| 256532 | 05/03/2016 17:44 | 12500 | NaN | 1000.0 | NaN | 200 | golf | 5000 | 0 | NaN | volkswagen | NaN | 28/02/2016 00:00 | 0 | 75378 | 07/04/2016 12:17 |
| 276318 | 31/03/2016 14:58 | 300 | NaN | 1910.0 | NaN | 0 | polo | 5000 | 0 | NaN | volkswagen | NaN | 31/03/2016 00:00 | 0 | 53902 | 06/04/2016 08:16 |
| 286928 | 18/03/2016 16:51 | 1 | NaN | 1000.0 | NaN | 174 | touareg | 5000 | 3 | gasoline | volkswagen | NaN | 18/03/2016 00:00 | 0 | 97616 | 05/04/2016 22:44 |
| 318111 | 25/03/2016 13:42 | 1 | NaN | 1910.0 | NaN | 0 | golf | 125000 | 0 | NaN | volkswagen | NaN | 25/03/2016 00:00 | 0 | 54295 | 06/04/2016 15:44 |
| 318501 | 02/04/2016 13:57 | 0 | NaN | 1910.0 | NaN | 0 | caddy | 5000 | 0 | NaN | volkswagen | NaN | 02/04/2016 00:00 | 0 | 16949 | 06/04/2016 12:16 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen |
|---|
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18224 | 09/03/2016 17:49 | 7999 | NaN | 1500.0 | manual | 224 | impreza | 5000 | 3 | NaN | subaru | NaN | 09/03/2016 00:00 | 0 | 53577 | 15/03/2016 05:15 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 199563 | 09/03/2016 20:37 | 60 | wagon | 1956.0 | NaN | 0 | other | 150000 | 0 | NaN | trabant | NaN | 09/03/2016 00:00 | 0 | 16775 | 05/04/2016 16:45 |
| 294028 | 28/03/2016 23:45 | 0 | NaN | 1111.0 | NaN | 0 | 601 | 5000 | 0 | NaN | trabant | NaN | 28/03/2016 00:00 | 0 | 6712 | 30/03/2016 16:45 |
4
df['datecreated'] = pd.to_datetime(df['datecreated'])
# Cars shouldn't be registered after datecreated
plt.scatter(df['registrationyear'], df['datecreated'].dt.year, alpha=0.3)
plt.xlabel("Registration Year")
plt.ylabel("Ad Creation Year")
plt.title("Registration Year vs Ad Creation Year")
plt.show()
registration_years = {
'fortwo': {'earliest': 1998, 'latest': 2016},
'forfour': {'earliest': 2004, 'latest': 2016},
'colt': {'earliest': 1962, 'latest': 2013},
'3er': {'earliest': 1975, 'latest': 2016},
'6er': {'earliest': 1976, 'latest': 2016},
'caddy': {'earliest': 1980, 'latest': 2016},
'polo': {'earliest': 1975, 'latest': 2016},
'golf': {'earliest': 1974, 'latest': 2016},
'beetle': {'earliest': 1938, 'latest': 2016},
'transporter': {'earliest': 1950, 'latest': 2016},
'touareg': {'earliest': 2002, 'latest': 2016},
'impreza': {'earliest': 1992, 'latest': 2016},
'601': {'earliest': 1964, 'latest': 1991},
'corsa': {'earliest': 1982, 'latest': 2016},
'astra': {'earliest': 1991, 'latest': 2016},
'passat': {'earliest': 1973, 'latest': 2016},
'a4': {'earliest': 1994, 'latest': 2016},
'c_klasse': {'earliest': 1993, 'latest': 2016},
'5er': {'earliest': 1972, 'latest': 2016},
'e_klasse': {'earliest': 1993, 'latest': 2016},
'a3': {'earliest': 1996, 'latest': 2016},
'focus': {'earliest': 1998, 'latest': 2016},
'fiesta': {'earliest': 1976, 'latest': 2016},
'a6': {'earliest': 1994, 'latest': 2016},
'twingo': {'earliest': 1993, 'latest': 2016},
'2_reihe': {'earliest': 1982, 'latest': 2016},
'vectra': {'earliest': 1988, 'latest': 2008},
'a_klasse': {'earliest': 1997, 'latest': 2016},
'mondeo': {'earliest': 1993, 'latest': 2016},
'clio': {'earliest': 1991, 'latest': 2016},
'1er': {'earliest': 2004, 'latest': 2016},
'3_reihe': {'earliest': 1982, 'latest': 2016},
'touran': {'earliest': 2003, 'latest': 2016},
'punto': {'earliest': 1993, 'latest': 2016},
'zafira': {'earliest': 1999, 'latest': 2016},
'megane': {'earliest': 1995, 'latest': 2016},
'ibiza': {'earliest': 1984, 'latest': 2016},
'ka': {'earliest': 1996, 'latest': 2016},
'lupo': {'earliest': 1998, 'latest': 2005},
'octavia': {'earliest': 1996, 'latest': 2016},
'fabia': {'earliest': 1999, 'latest': 2016},
'cooper': {'earliest': 2001, 'latest': 2016},
'clk': {'earliest': 1997, 'latest': 2010},
'micra': {'earliest': 1982, 'latest': 2016},
'80': {'earliest': 1972, 'latest': 1996},
'x_reihe': {'earliest': 2000, 'latest': 2016},
'sharan': {'earliest': 1995, 'latest': 2016},
'scenic': {'earliest': 1996, 'latest': 2016},
'omega': {'earliest': 1986, 'latest': 2003},
'laguna': {'earliest': 1994, 'latest': 2016},
'civic': {'earliest': 1972, 'latest': 2016},
'1_reihe': {'earliest': 1970, 'latest': 2016},
'leon': {'earliest': 1999, 'latest': 2016},
'6_reihe': {'earliest': 2003, 'latest': 2016},
'i_reihe': {'earliest': 2004, 'latest': 2016},
'slk': {'earliest': 1996, 'latest': 2016},
'galaxy': {'earliest': 1959, 'latest': 2016},
'tt': {'earliest': 1998, 'latest': 2016},
'meriva': {'earliest': 2003, 'latest': 2016},
'yaris': {'earliest': 1999, 'latest': 2016},
'7er': {'earliest': 1977, 'latest': 2016},
'mx_reihe': {'earliest': 1989, 'latest': 2016},
'kangoo': {'earliest': 1997, 'latest': 2016},
'm_klasse': {'earliest': 1997, 'latest': 2016},
'500': {'earliest': 1957, 'latest': 2016},
'escort': {'earliest': 1968, 'latest': 2000},
'arosa': {'earliest': 1997, 'latest': 2005},
'one': {'earliest': 2001, 'latest': 2016},
's_klasse': {'earliest': 1972, 'latest': 2016},
'vito': {'earliest': 1996, 'latest': 2016},
'b_klasse': {'earliest': 2005, 'latest': 2016},
'bora': {'earliest': 1998, 'latest': 2005},
'berlingo': {'earliest': 1996, 'latest': 2016},
'tigra': {'earliest': 1994, 'latest': 2008},
'v40': {'earliest': 1995, 'latest': 2016},
'sprinter': {'earliest': 1995, 'latest': 2016},
'transit': {'earliest': 1965, 'latest': 2016},
'fox': {'earliest': 2003, 'latest': 2016},
'z_reihe': {'earliest': 1998, 'latest': 2016},
'swift': {'earliest': 1983, 'latest': 2016},
'c_max': {'earliest': 2003, 'latest': 2016},
'corolla': {'earliest': 1966, 'latest': 2016},
'panda': {'earliest': 1980, 'latest': 2016},
'seicento': {'earliest': 1998, 'latest': 2007},
'tiguan': {'earliest': 2007, 'latest': 2016},
'insignia': {'earliest': 2008, 'latest': 2016},
'4_reihe': {'earliest': 1892, 'latest': 2016},
'v70': {'earliest': 1997, 'latest': 2016},
'156': {'earliest': 1997, 'latest': 2005},
'primera': {'earliest': 1990, 'latest': 2007},
'espace': {'earliest': 1984, 'latest': 2016},
'scirocco': {'earliest': 1974, 'latest': 2017},
'stilo': {'earliest': 2001, 'latest': 2008},
'a1': {'earliest': 2010, 'latest': 2025},
'almera': {'earliest': 1995, 'latest': 2006},
'147': {'earliest': 2000, 'latest': 2010},
'avensis': {'earliest': 1997, 'latest': 2016},
'grand': {'earliest': 1924, 'latest': 2016},
'a5': {'earliest': 2007, 'latest': 2016},
'qashqai': {'earliest': 2006, 'latest': 2016},
'a8': {'earliest': 1994, 'latest': 2016},
'eos': {'earliest': 2006, 'latest': 2016},
'c3': {'earliest': 2002, 'latest': 2016},
'navara': {'earliest': 1997, 'latest': 2016},
'c4': {'earliest': 2004, 'latest': 2016},
'kadett': {'earliest': 1937, 'latest': 1991},
'signum': {'earliest': 2003, 'latest': 2008},
'jetta': {'earliest': 1979, 'latest': 2016},
'forester': {'earliest': 1997, 'latest': 2016},
'xc_reihe': {'earliest': 2001, 'latest': 2016},
'combo': {'earliest': 1993, 'latest': 2016},
'jazz': {'earliest': 2001, 'latest': 2016},
'100': {'earliest': 1968, 'latest': 1994},
'sportage': {'earliest': 1993, 'latest': 2016},
'sorento': {'earliest': 2002, 'latest': 2016},
'mustang': {'earliest': 1964, 'latest': 2016},
'getz': {'earliest': 2002, 'latest': 2011},
'r19': {'earliest': 1988, 'latest': 1996},
'cordoba': {'earliest': 1993, 'latest': 2009},
'up': {'earliest': 2011, 'latest': 2016},
'ceed': {'earliest': 2006, 'latest': 2016},
'5_reihe': {'earliest': 1972, 'latest': 2016},
'yeti': {'earliest': 2009, 'latest': 2016},
'mii': {'earliest': 2011, 'latest': 2016},
'rx_reihe': {'earliest': 1978, 'latest': 2012},
'modus': {'earliest': 2004, 'latest': 2012},
'matiz': {'earliest': 1998, 'latest': 2016},
'c1': {'earliest': 2005, 'latest': 2016},
'rio': {'earliest': 2000, 'latest': 2016},
'logan': {'earliest': 2004, 'latest': 2016},
'spider': {'earliest': 1996, 'latest': 2006},
'cuore': {'earliest': 1977, 'latest': 2009},
's_max': {'earliest': 2006, 'latest': 2015},
'a2': {'earliest': 1999, 'latest': 2005},
'viano': {'earliest': 2003, 'latest': 2014},
'roomster': {'earliest': 2006, 'latest': 2015},
'sl': {'earliest': 1952, 'latest': 2011},
'santa': {'earliest': 1999, 'latest': 2013},
'ptcruiser':{'earliest': 2000, 'latest': 2010},
'exeo': {'earliest': 2008, 'latest': 2013},
'159': {'earliest': 2005, 'latest': 2011},
'juke': {'earliest': 2010, 'latest': 2016},
'carisma': {'earliest': 1995, 'latest': 2006},
'accord': {'earliest': 1976, 'latest': 2016},
'lanos': {'earliest': 1997, 'latest': 2009},
'phaeton': {'earliest': 2002, 'latest': 2016},
'verso': {'earliest': 2001, 'latest': 2016},
'rav': {'earliest': 1994, 'latest': 2016},
'picanto': {'earliest': 2003, 'latest': 2016},
'boxster': {'earliest': 1996, 'latest': 2016},
'kalos': {'earliest': 2002, 'latest': 2011},
'superb': {'earliest': 2001, 'latest': 2016},
'alhambra': {'earliest': 1996, 'latest': 2010},
'roadster': {'earliest': 1998, 'latest': 2016},
'ypsilon': {'earliest': 1995, 'latest': 2016},
'cayenne': {'earliest': 2002, 'latest': 2016},
'galant': {'earliest': 1969, 'latest': 2012},
'justy': {'earliest': 1984, 'latest': 2010},
'90': {'earliest': 1984, 'latest': 1987},
'sirion': {'earliest': 1995, 'latest': 2016},
'crossfire': {'earliest': 2003, 'latest': 2008},
'agila': {'earliest': 2000, 'latest': 2014},
'duster': {'earliest': 2010, 'latest': 2016},
'cr_reihe': {'earliest': 1995, 'latest': 2016},
'v50': {'earliest': 2004, 'latest': 2012},
'c_reihe': {'earliest': 1993, 'latest': 2016},
'v_klasse': {'earliest': 1996, 'latest': 2016},
'c5': {'earliest': 2001, 'latest': 2016},
'aygo': {'earliest': 2005, 'latest': 2016},
'cc': {'earliest': 2008, 'latest': 2016},
'carnival': {'earliest': 1998, 'latest': 2016},
'fusion': {'earliest': 2002, 'latest': 2016},
'911': {'earliest': 1963, 'latest': 2016},
'm_reihe': {'earliest': 1976, 'latest': 2016},
'cl': {'earliest': 1996, 'latest': 2014},
'300c': {'earliest': 2005, 'latest': 2016},
'spark': {'earliest': 1998, 'latest': 2016},
'kuga': {'earliest': 2008, 'latest': 2016},
'x_type': {'earliest': 2001, 'latest': 2009},
'ducato': {'earliest': 1981, 'latest': 2016},
's_type': {'earliest': 1998, 'latest': 2008},
'x_trail': {'earliest': 2000, 'latest': 2016},
'toledo': {'earliest': 1991, 'latest': 2013},
'altea': {'earliest': 2004, 'latest': 2015},
'voyager': {'earliest': 1984, 'latest': 2016},
'calibra': {'earliest': 1989, 'latest': 1997},
'bravo': {'earliest': 1995, 'latest': 2006},
'antara': {'earliest': 2006, 'latest': 2016},
'tucson': {'earliest': 2004, 'latest': 2016},
'citigo': {'earliest': 2011, 'latest': 2016},
'jimny': {'earliest': 1983, 'latest': 2016},
'wrangler': {'earliest': 1986, 'latest': 2016},
'lybra': {'earliest': 1998, 'latest': 2016},
'q7': {'earliest': 2005, 'latest': 2016},
'lancer': {'earliest': 1973, 'latest': 2016},
'captiva': {'earliest': 2006, 'latest': 2016},
'c2': {'earliest': 2003, 'latest': 2009},
'discovery': {'earliest': 1989, 'latest': 2016},
'freelander': {'earliest': 1997, 'latest': 2014},
'sandero': {'earliest': 2007, 'latest': 2016},
'note': {'earliest': 2004, 'latest': 2016},
'900': {'earliest': 1978, 'latest': 1993},
'cherokee': {'earliest': 1984, 'latest': 2016},
'clubman': {'earliest': 2007, 'latest': 2016},
'samara': {'earliest': 1984, 'latest': 2001},
'defender': {'earliest': 1983, 'latest': 2016},
'cx_reihe': {'earliest': 2006, 'latest': 2011},
'legacy': {'earliest': 1989, 'latest': 2016},
'pajero': {'earliest': 1982, 'latest': 2016},
'auris': {'earliest': 2006, 'latest': 2016},
'niva': {'earliest': 1977, 'latest': 2016},
's60': {'earliest': 2000, 'latest': 2016},
'nubira': {'earliest': 1997, 'latest': 2008},
'vivaro': {'earliest': 2001, 'latest': 2016},
'g_klasse': {'earliest': 1979, 'latest': 2016},
'lodgy': {'earliest': 2012, 'latest': 2016},
'850': {'earliest': 1991, 'latest': 1997},
'range_rover': {'earliest': 1970, 'latest': 2016},
'q3': {'earliest': 2011, 'latest': 2016},
'serie_2': {'earliest': 1958, 'latest': 2016},
'glk': {'earliest': 2008, 'latest': 2015},
'charade': {'earliest': 1977, 'latest': 2000},
'croma': {'earliest': 1985, 'latest': 2010},
'outlander': {'earliest': 2001, 'latest': 2016},
'doblo': {'earliest': 2000, 'latest': 2016},
'musa': {'earliest': 2004, 'latest': 2012},
'move': {'earliest': 1998, 'latest': 2002},
'9000': {'earliest': 1985, 'latest': 1998},
'v60': {'earliest': 2010, 'latest': 2016},
'145': {'earliest': 1994, 'latest': 2000},
'aveo': {'earliest': 2002, 'latest': 2011},
'200': {'earliest': 1980, 'latest': 2007},
'b_max': {'earliest': 2007, 'latest': 2012},
'range_rover_sport': {'earliest': 2005, 'latest': 2016},
'terios': {'earliest': 1997, 'latest': 2016},
'rangerover': {'earliest': 1970, 'latest': 2016},
'q5': {'earliest': 2008, 'latest': 2016},
'range_rover_evoque':{'earliest': 2011, 'latest': 2016},
'materia': {'earliest': 2007, 'latest': 2012},
'delta': {'earliest': 1979, 'latest': 2014},
'gl': {'earliest': 2006, 'latest': 2015},
'kalina': {'earliest': 2004, 'latest': 2016},
'amarok': {'earliest': 2010, 'latest': 2016},
'elefantino': {'earliest': 1963, 'latest': 2011},
'i3': {'earliest': 2013, 'latest': 2016},
'kappa': {'earliest': 1994, 'latest': 2001},
'serie_3': {'earliest': 1975, 'latest': 2016},
'serie_1': {'earliest': 2004, 'latest': 2016},
'mercedes_benz': {'earliest': 1926, 'latest': 2016},
'citroen': {'earliest': 1919, 'latest': 2016},
'fiat': {'earliest': 1899, 'latest': 2016},
'ford': {'earliest': 1903, 'latest': 2016},
'hyundai': {'earliest': 1967, 'latest': 2016},
'peugeot': {'earliest': 1889, 'latest': 2016},
'opel': {'earliest': 1899, 'latest': 2016},
'suzuki': {'earliest': 1955, 'latest': 2016},
'audi': {'earliest': 1910, 'latest': 2016},
'mazda': {'earliest': 1931, 'latest': 2016},
'renault': {'earliest': 1898, 'latest': 2016},
'chevrolet': {'earliest': 1911, 'latest': 2016},
'toyota': {'earliest': 1936, 'latest': 2016},
'mitsubishi': {'earliest': 1917, 'latest': 2016},
'volkswagen': {'earliest': 1937, 'latest': 2016},
'nissan': {'earliest': 1933, 'latest': 2016},
'volvo': {'earliest': 1927, 'latest': 2016},
'alfa_romeo': {'earliest': 1910, 'latest': 2016},
'kia': {'earliest': 1944, 'latest': 2016},
'rover': {'earliest': 1904, 'latest': 2005},
'chrysler': {'earliest': 1925, 'latest': 2016},
'saab': {'earliest': 1947, 'latest': 2011},
'honda': {'earliest': 1963, 'latest': 2016},
'skoda': {'earliest': 1905, 'latest': 2016},
'bmw': {'earliest': 1928, 'latest': 2016},
'jaguar': {'earliest': 1935, 'latest': 2016},
'porsche': {'earliest': 1948, 'latest': 2016},
'jeep': {'earliest': 1941, 'latest': 2016},
'seat': {'earliest': 1950, 'latest': 2016},
'daihatsu': {'earliest': 1951, 'latest': 2016},
'lancia': {'earliest': 1908, 'latest': 2016},
'mini': {'earliest': 1959, 'latest': 2016},
'daewoo': {'earliest': 1937, 'latest': 2011},
'trabant': {'earliest': 1957, 'latest': 1991},
'smart': {'earliest': 1998, 'latest': 2016},
'subaru': {'earliest': 1954, 'latest': 2016},
'lada': {'earliest': 1966, 'latest': 2016},
'dacia': {'earliest': 1966, 'latest': 2016},
'land_rover': {'earliest': 1948, 'latest': 2016},
'other': {'earliest': 1893, 'latest': 2016}
}
registration_cols = list(registration_years.keys())
for registration in registration_cols:
earliest = registration_years[registration]['earliest']
latest = registration_years[registration]['latest']
# Too early
early_reg = (df['model'] == registration) & (df['registrationyear'] < earliest)
df.loc[early_reg,['registration_correction']] = "Y: too early"
# Too late
late_reg = (df['model'] == registration) & (df['registrationyear'] > latest)
df.loc[late_reg,['registration_correction']] = "Y: too late"
# Acceptable range
acc_reg = (df['model'] == registration) & (df['registrationyear'] >= earliest) & (df['registrationyear'] <= latest)
df.loc[acc_reg,['registration_correction']] = 'N'
df['registration_correction'].isna().sum()
19705
del registration_years, registration_cols
gc.collect()
2731
# Code to see the Nan values in made column `registrationy_correction`
display(df[df['registration_correction'].isna()].value_counts(subset = 'brand'))
# Code to get the list of registration years that were too early
display(df[df['registration_correction'] == "Y: too early"].value_counts(subset = 'model').index.to_list())
# Code to see what needs to be fixed
df['registration_correction'].value_counts(dropna = False)
brand volkswagen 3505 sonstige_autos 3374 bmw 1935 opel 1871 audi 1283 mercedes_benz 1210 ford 1013 peugeot 891 renault 735 fiat 502 mazda 352 smart 268 citroen 249 seat 235 hyundai 228 nissan 207 mitsubishi 160 toyota 157 honda 155 skoda 141 suzuki 137 alfa_romeo 137 kia 113 chevrolet 110 volvo 109 trabant 95 chrysler 89 rover 82 mini 62 daewoo 46 subaru 44 porsche 42 daihatsu 39 jeep 26 lancia 23 dacia 22 saab 16 lada 15 jaguar 14 land_rover 13 dtype: int64
['e_klasse', '6_reihe', 'z_reihe', 'spider', 'cr_reihe', 'c_klasse', '1er', 'cooper', 'x_type', 'forfour', 's_klasse', 'sprinter', 'seicento', '300c', 'golf', 'antara', 'a6', 'cl', 'move', 'b_klasse', 'zafira', 'astra', 'clio', 'i3', 'fox', '601', 'a4', 'kuga', 'punto', 'lupo', '3er', 'twingo', 'v60', 'ka', 'touran', 'focus', 'corsa', 'vivaro', 'polo', 'mondeo', 'glk', 'a3', 'a_klasse', 'c3', 'cx_reihe', 'signum', 'arosa', 'up', 'verso', 'other', '156', 'a5', '911', 'insignia', '500', 'scenic', 'fusion', 'agila', 'beetle', 'sorento', 'kangoo', 'fabia', 'c4', 'defender', '3_reihe', 'tucson', '2_reihe', 'v70', 'ypsilon', 'xc_reihe', 'viano', 'meriva', 'modus', 'a2', 'octavia', 'wrangler', '159', 'tigra', 'passat', 'a1', 'touareg', 'range_rover_evoque', 'vectra', '7er', 'serie_1', 'transporter', 'materia', 'c5', 'fortwo', 'leon', 'boxster', 'caddy', 'laguna', 'combo', '145', 'impreza', 'tt', 'c2', 'cordoba', 'logan', 'c1', '1_reihe', 'transit', 'bravo', 'v50', 'clubman', 'colt', 'stilo', 'clk', 'cherokee', 'vito', 'cc', 'cayenne', 'x_reihe', 'carnival', 'calibra', '147', 'escort', 'spark', 'espace', 'altea', 'lanos', 'c_max', 'kalos', 'omega', 'jazz', 'pajero', 'picanto', 'i_reihe', 'alhambra', 'q3', 'range_rover_sport', '6er', 'rio', 'roomster', '850', 'santa', 'gl', 'g_klasse', 'serie_3', 'slk', 'berlingo', 'megane', '100']
N 318683 NaN 19705 Y: too late 14031 Y: too early 1950 Name: registration_correction, dtype: int64
# Only consider “reasonable” registration years
reasonable_mask = df['registrationyear'].between(1885, 2016)
df_valid = df[reasonable_mask]
# Group by brand AND model since peugeot and mazda both have a model named 1_reihe - others may too
# Need the count to make sure there are enough observations to pull from
model_stats = df_valid.groupby(['brand', 'model'])['registrationyear'].agg(['min','count']).reset_index()
# Only trust models with at least 100 observations
min_count = 100
trusted_models = model_stats[model_stats['count'] >= min_count]
# Merge min_year for each brand+model
df = df.merge(
trusted_models[['brand','model','min']],
on=['brand','model'],
how='left'
)
# Update registrationyears by using the Y: too early query from registration_correction column
mask = (df['registration_correction'] == "Y: too early") & (df['registrationyear'] < df['min'])
# Replace registrationyear
df.loc[mask, 'registrationyear'] = df.loc[mask, 'min']
# Mark as corrected
df.loc[mask, 'registration_correction'] = "N"
# Ensure marked correctly
df[mask]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | min | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16062 | 29/03/2016 23:42 | 190 | NaN | 1910.0 | NaN | 0 | mondeo | 5000 | 0 | NaN | ford | NaN | 2016-03-29 | 0 | 47166 | 06/04/2016 10:44 | N | 1910.0 |
| 18224 | 09/03/2016 17:49 | 7999 | NaN | 1980.0 | manual | 224 | impreza | 5000 | 3 | NaN | subaru | NaN | 2016-09-03 | 0 | 53577 | 15/03/2016 05:15 | N | 1980.0 |
| 53577 | 20/03/2016 11:44 | 330 | NaN | 1910.0 | NaN | 0 | polo | 5000 | 0 | NaN | volkswagen | NaN | 2016-03-20 | 0 | 45259 | 04/04/2016 08:17 | N | 1910.0 |
| 56241 | 30/03/2016 18:54 | 950 | NaN | 1910.0 | manual | 1400 | golf | 125000 | 4 | petrol | volkswagen | NaN | 2016-03-30 | 0 | 50389 | 03/04/2016 09:45 | N | 1910.0 |
| 66621 | 07/03/2016 15:39 | 0 | NaN | 1910.0 | auto | 1200 | punto | 150000 | 0 | petrol | fiat | NaN | 2016-07-03 | 0 | 78333 | 10/03/2016 14:17 | N | 1910.0 |
| 67167 | 01/04/2016 08:55 | 0 | NaN | 1998.0 | NaN | 0 | kuga | 5000 | 7 | gasoline | ford | NaN | 2016-01-04 | 0 | 98693 | 07/04/2016 05:44 | N | 1998.0 |
| 71459 | 27/03/2016 23:46 | 500 | NaN | 1910.0 | NaN | 0 | golf | 5000 | 0 | NaN | volkswagen | NaN | 2016-03-27 | 0 | 91628 | 29/03/2016 13:46 | N | 1910.0 |
| 79120 | 15/03/2016 18:47 | 4750 | NaN | 1954.0 | NaN | 0 | other | 5000 | 0 | NaN | renault | NaN | 2016-03-15 | 0 | 66706 | 06/04/2016 20:19 | N | 1954.0 |
| 104698 | 27/03/2016 13:52 | 100 | NaN | 1990.0 | NaN | 0 | 156 | 5000 | 0 | petrol | alfa_romeo | NaN | 2016-03-27 | 0 | 21680 | 07/04/2016 09:46 | N | 1990.0 |
| 119442 | 18/03/2016 10:37 | 1 | NaN | 1910.0 | NaN | 1000 | 3er | 5000 | 0 | NaN | bmw | NaN | 2016-03-18 | 0 | 94086 | 05/04/2016 22:16 | N | 1910.0 |
| 125577 | 15/03/2016 18:38 | 7750 | NaN | 1910.0 | NaN | 0 | other | 5000 | 0 | NaN | citroen | NaN | 2016-03-15 | 0 | 66706 | 06/04/2016 18:47 | N | 1910.0 |
| 129768 | 05/03/2016 17:55 | 275 | NaN | 1971.0 | NaN | 0 | e_klasse | 5000 | 0 | NaN | mercedes_benz | NaN | 2016-05-03 | 0 | 12627 | 05/03/2016 17:55 | N | 1971.0 |
| 143621 | 17/03/2016 23:40 | 550 | NaN | 1910.0 | NaN | 1000 | golf | 5000 | 6 | petrol | volkswagen | NaN | 2016-03-17 | 0 | 91732 | 26/03/2016 05:18 | N | 1910.0 |
| 164397 | 29/03/2016 17:49 | 0 | NaN | 1910.0 | NaN | 0 | transporter | 5000 | 1 | NaN | volkswagen | NaN | 2016-03-29 | 0 | 29351 | 06/04/2016 03:45 | N | 1910.0 |
| 174893 | 05/03/2016 19:48 | 0 | NaN | 1910.0 | NaN | 1000 | golf | 5000 | 4 | petrol | volkswagen | NaN | 2016-05-03 | 0 | 35716 | 05/03/2016 22:27 | N | 1910.0 |
| 189722 | 29/03/2016 16:56 | 1500 | NaN | 1910.0 | manual | 0 | beetle | 5000 | 0 | petrol | volkswagen | NaN | 2016-03-29 | 0 | 48324 | 31/03/2016 10:15 | N | 1910.0 |
| 192705 | 31/03/2016 15:47 | 20 | NaN | 1990.0 | NaN | 0 | 156 | 5000 | 0 | NaN | alfa_romeo | NaN | 2016-03-31 | 0 | 31224 | 06/04/2016 08:46 | N | 1990.0 |
| 195855 | 28/03/2016 23:40 | 1 | NaN | 1990.0 | NaN | 0 | zafira | 5000 | 0 | NaN | opel | NaN | 2016-03-28 | 0 | 50171 | 05/04/2016 03:44 | N | 1990.0 |
| 213499 | 08/03/2016 12:06 | 380 | NaN | 1976.0 | NaN | 0 | 6er | 5000 | 0 | NaN | bmw | NaN | 2016-08-03 | 0 | 35102 | 06/04/2016 00:16 | N | 1976.0 |
| 216770 | 02/04/2016 14:39 | 60 | NaN | 1910.0 | NaN | 0 | corsa | 5000 | 0 | NaN | opel | NaN | 2016-02-04 | 0 | 41844 | 02/04/2016 14:39 | N | 1910.0 |
| 218241 | 16/03/2016 12:46 | 7999 | NaN | 1910.0 | NaN | 290 | golf | 5000 | 6 | NaN | volkswagen | NaN | 2016-03-16 | 0 | 15827 | 29/03/2016 20:47 | N | 1910.0 |
| 252420 | 27/03/2016 16:39 | 149 | NaN | 1987.0 | NaN | 0 | 1_reihe | 5000 | 0 | NaN | peugeot | NaN | 2016-03-27 | 0 | 33605 | 05/04/2016 11:47 | N | 1987.0 |
| 256532 | 05/03/2016 17:44 | 12500 | NaN | 1910.0 | NaN | 200 | golf | 5000 | 0 | NaN | volkswagen | NaN | 2016-02-28 | 0 | 75378 | 07/04/2016 12:17 | N | 1910.0 |
| 275472 | 05/04/2016 11:39 | 530 | NaN | 1980.0 | NaN | 0 | 300c | 5000 | 0 | NaN | chrysler | NaN | 2016-05-04 | 0 | 52152 | 05/04/2016 11:39 | N | 1980.0 |
| 286928 | 18/03/2016 16:51 | 1 | NaN | 1992.0 | NaN | 174 | touareg | 5000 | 3 | gasoline | volkswagen | NaN | 2016-03-18 | 0 | 97616 | 05/04/2016 22:44 | N | 1992.0 |
| 287304 | 09/03/2016 15:54 | 500 | NaN | 1929.0 | manual | 0 | other | 5000 | 0 | NaN | bmw | yes | 2016-09-03 | 0 | 30900 | 10/03/2016 12:17 | N | 1929.0 |
| 294028 | 28/03/2016 23:45 | 0 | NaN | 1960.0 | NaN | 0 | 601 | 5000 | 0 | NaN | trabant | NaN | 2016-03-28 | 0 | 6712 | 30/03/2016 16:45 | N | 1960.0 |
| 319412 | 25/03/2016 18:58 | 480 | NaN | 1945.0 | NaN | 0 | astra | 5000 | 0 | NaN | opel | NaN | 2016-03-25 | 0 | 96160 | 07/04/2016 01:44 | N | 1945.0 |
| 340759 | 04/04/2016 23:55 | 700 | NaN | 1989.0 | manual | 1600 | a3 | 150000 | 4 | petrol | audi | no | 2016-04-04 | 0 | 86343 | 05/04/2016 06:44 | N | 1989.0 |
| 351682 | 12/03/2016 00:57 | 11500 | NaN | 1910.0 | NaN | 16 | other | 5000 | 6 | petrol | fiat | NaN | 2016-11-03 | 0 | 16515 | 05/04/2016 19:47 | N | 1910.0 |
# Remove added column
df = df.drop(columns=['min'])
# Filled in 30 rows for the Y: too early column
df['registration_correction'].value_counts(dropna = False)
N 318713 NaN 19705 Y: too late 14031 Y: too early 1920 Name: registration_correction, dtype: int64
# Do the same thing for the Y: too late rows
reasonable_mask = df['registrationyear'].between(1885, 2016)
df_valid = df[reasonable_mask]
model_stats = df_valid.groupby(['brand', 'model'])['registrationyear'].agg(['max','count']).reset_index()
min_count = 100
trusted_models = model_stats[model_stats['count'] >= min_count]
df = df.merge(
trusted_models[['brand','model','max']],
on=['brand','model'],
how='left'
)
mask = (df['registration_correction'] == "Y: too late") & (df['registrationyear'] > df['max'])
df.loc[mask, 'registrationyear'] = df.loc[mask, 'max']
df.loc[mask, 'registration_correction'] = "N"
display(df[mask])
df = df.drop(columns=['max'])
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | 23/03/2016 14:52 | 2900 | NaN | 2016.0 | manual | 90 | meriva | 150000 | 5 | petrol | opel | no | 2016-03-23 | 0 | 49716 | 31/03/2016 01:16 | N | 2016.0 |
| 26 | 10/03/2016 19:38 | 5555 | NaN | 2016.0 | manual | 125 | c4 | 125000 | 4 | NaN | citroen | no | 2016-10-03 | 0 | 31139 | 16/03/2016 09:16 | N | 2016.0 |
| 48 | 25/03/2016 14:40 | 7750 | NaN | 2016.0 | manual | 80 | golf | 100000 | 1 | petrol | volkswagen | NaN | 2016-03-25 | 0 | 48499 | 31/03/2016 21:47 | N | 2016.0 |
| 51 | 07/03/2016 18:57 | 2000 | NaN | 2016.0 | manual | 90 | punto | 150000 | 11 | gasoline | fiat | yes | 2016-07-03 | 0 | 66115 | 07/03/2016 18:57 | N | 2016.0 |
| 57 | 10/03/2016 20:53 | 2399 | NaN | 2016.0 | manual | 64 | other | 125000 | 3 | NaN | seat | no | 2016-10-03 | 0 | 33397 | 25/03/2016 10:17 | N | 2016.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354112 | 11/03/2016 15:49 | 3600 | NaN | 2016.0 | manual | 86 | transit | 150000 | 5 | gasoline | ford | NaN | 2016-11-03 | 0 | 32339 | 12/03/2016 05:45 | N | 2016.0 |
| 354140 | 29/03/2016 16:47 | 1000 | NaN | 2016.0 | manual | 101 | a4 | 150000 | 9 | NaN | audi | NaN | 2016-03-29 | 0 | 38315 | 06/04/2016 02:44 | N | 2016.0 |
| 354203 | 17/03/2016 00:56 | 2140 | NaN | 2016.0 | manual | 80 | fiesta | 150000 | 6 | NaN | ford | no | 2016-03-17 | 0 | 44866 | 29/03/2016 15:45 | N | 2016.0 |
| 354253 | 25/03/2016 09:37 | 1250 | NaN | 2016.0 | NaN | 0 | corsa | 150000 | 0 | petrol | opel | NaN | 2016-03-25 | 0 | 45527 | 06/04/2016 07:46 | N | 2016.0 |
| 354289 | 05/03/2016 14:55 | 5000 | NaN | 2016.0 | manual | 120 | other | 150000 | 7 | NaN | citroen | yes | 2016-05-03 | 0 | 15518 | 05/04/2016 11:48 | N | 2016.0 |
12373 rows × 18 columns
# Removed 12,373 from Y: too late
display(df['registration_correction'].value_counts(dropna = False))
del reasonable_mask, df_valid, model_stats, mask, min_count, trusted_models
gc.collect()
N 331086 NaN 19705 Y: too early 1920 Y: too late 1658 Name: registration_correction, dtype: int64
0
Total registrationyear rows fixed: 12,403
Total rows: 354,369
Percent Fixed: 3.5%
Duplicate Handling¶
display(df.duplicated().sum())
df[df.duplicated()]
df = df.drop_duplicates()
df.duplicated().sum()
262
0
Missing Values¶
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 354107 entries, 0 to 354368 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 354107 non-null object 1 price 354107 non-null int64 2 vehicletype 316623 non-null object 3 registrationyear 354075 non-null float64 4 gearbox 334277 non-null object 5 power 354107 non-null int64 6 model 334406 non-null object 7 mileage 354107 non-null int64 8 registrationmonth 354107 non-null int64 9 fueltype 321218 non-null object 10 brand 354107 non-null object 11 notrepaired 282962 non-null object 12 datecreated 354107 non-null datetime64[ns] 13 numberofpictures 354107 non-null int64 14 postalcode 354107 non-null int64 15 lastseen 354107 non-null object 16 registration_correction 334406 non-null object dtypes: datetime64[ns](1), float64(1), int64(6), object(9) memory usage: 48.6+ MB
Missing Values:
| Column | Percent Missing |
|---|---|
| Vehicle Type: | 10.586 % |
| Registration Year: | 6.575 % |
| GearBox: | 5.600 % |
| Model: | 5.564 % |
| FuelType: | 9.288 % |
| NotReparied: | 20.091 % |
# Percent Missing
print("Percent Missing")
print("===============")
vt = 354107 - 316623
vtp = (vt/354107) * 100
print(f"Vehicle Type: \n{vtp:.3f} %")
print("")
ry = 23283
ryp = (ry/354107) * 100
print(f"Registration Year: \n{ryp:.3f} %")
print("")
gb = 354107 - 334277
gbp = (gb/354107) * 100
print(f"GearBox: \n{gbp:.3f} %")
print("")
m = 354107 - 334406
mp = (m/354107) * 100
print(f"Model: \n{mp:.3f} %")
print("")
ft = 354107 - 321218
ftp = (ft/354107) * 100
print(f"FuelType: \n{ftp:.3f} %")
print("")
nr = 354107 - 282962
nrp = (nr/354107) * 100
print(f"NotReparied: \n{nrp:.3f} %")
del vt, vtp, ry, ryp, gb, gbp, m, mp, ft, ftp, nr, nrp
gc.collect()
Percent Missing =============== Vehicle Type: 10.586 % Registration Year: 6.575 % GearBox: 5.600 % Model: 5.564 % FuelType: 9.288 % NotReparied: 20.091 %
0
# Inpect Model Column
model_col = df[(df['model'].isna()) & (df['brand'] != 'sonstige_autos')]
model_col_p0 = model_col[model_col['power'] == 0]
brand_p0 = model_col_p0['brand'].value_counts()
brand_p0.plot(kind='bar', x='brand', y='power', figsize=(12,6))
plt.title('0hp powered vehicles by brand (model info. missing)')
plt.xlabel('Brand')
plt.ylabel('0hp Power Frequency')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
display(brand_p0)
display(model_col_p0['brand'].unique())
brand_p0_rows_bottom = ['mitsubishi', 'skoda', 'chevrolet', 'kia', 'porsche', 'chrysler', 'volvo', 'rover', 'daihatsu',
'daewoo', 'subaru', 'mini', 'lada', 'dacia', 'jeep', 'jaguar', 'lancia', 'saab', 'land_rover']
brand_p0_rows_mbottom = ['citroen', 'seat', 'hyundai', 'nissan', 'trabant', 'suzuki', 'toyota',
'alfa_romeo', 'honda']
brand_p0_rows_middle = ['ford', 'audi', 'peugeot', 'renault', 'fiat', 'mazda', 'smart']
brand_p0_rows_top = ['bmw', 'opel', 'mercedes_benz']
brand_p0_rows_vw = ['volkswagen']
# Separate by brand and known model
model_col_notna_top_bottom = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_bottom))]
model_col_notna_top_mbottom = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_mbottom))]
model_col_notna_top_middle = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_middle))]
model_col_notna_top = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_top))]
model_col_notna_top_vw = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_vw))]
# Use known model to find NaN values
# Least missing
brand_p0_notna_top_bottom = model_col_notna_top_bottom[['brand','model']].value_counts().sort_index()
# Middle Least missing
brand_p0_notna_top_mbottom = model_col_notna_top_mbottom[['brand','model']].value_counts().sort_index()
# Middle missing
brand_p0_notna_top_middle = model_col_notna_top_middle[['brand','model']].value_counts().sort_index()
# Top Missing
brand_p0_notna_top = model_col_notna_top[['brand','model']].value_counts().sort_index()
# Volkswagen
brand_p0_notna_top_vw = model_col_notna_top_vw[['brand','model']].value_counts()
# Least Missing
with pd.option_context('display.max_rows', None):
display(brand_p0_notna_top_bottom)
brand_p0_notna_top_bottom.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Least Missing/ Bottom Tier (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
# Next Least Missing
display(brand_p0_notna_top_mbottom)
brand_p0_notna_top_mbottom.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Least Missing Ext/ Bottom Tier 2 (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
# Middle Missing
with pd.option_context('display.max_rows', None):
display(brand_p0_notna_top_middle)
brand_p0_notna_top_middle.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Middle Missing (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
# Top Missing
display(brand_p0_notna_top)
brand_p0_notna_top.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Top Missing (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
# Volkswagen
display(brand_p0_notna_top_vw)
brand_p0_notna_top_vw.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Volkswagen (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
volkswagen 990 bmw 536 opel 512 mercedes_benz 429 ford 341 audi 325 peugeot 272 renault 254 fiat 182 mazda 114 smart 100 citroen 78 seat 73 hyundai 68 nissan 63 trabant 58 suzuki 50 toyota 47 alfa_romeo 42 honda 40 skoda 38 mitsubishi 38 chevrolet 34 kia 27 volvo 25 chrysler 25 porsche 25 rover 23 daihatsu 19 daewoo 16 mini 12 subaru 12 dacia 7 lada 7 jeep 6 jaguar 6 lancia 5 saab 3 land_rover 2 Name: brand, dtype: int64
array(['volkswagen', 'renault', 'mitsubishi', 'bmw', 'peugeot', 'audi',
'volvo', 'chevrolet', 'trabant', 'opel', 'smart', 'nissan',
'suzuki', 'mercedes_benz', 'mazda', 'seat', 'fiat', 'citroen',
'ford', 'skoda', 'kia', 'chrysler', 'daewoo', 'alfa_romeo',
'rover', 'porsche', 'dacia', 'honda', 'lada', 'subaru', 'hyundai',
'toyota', 'mini', 'jaguar', 'daihatsu', 'saab', 'land_rover',
'lancia', 'jeep'], dtype=object)
brand model
chevrolet aveo 8
captiva 8
matiz 41
other 129
spark 5
chrysler 300c 17
crossfire 6
grand 8
other 45
ptcruiser 34
voyager 65
dacia duster 13
lodgy 5
logan 33
other 1
sandero 12
daewoo kalos 16
lanos 18
matiz 39
nubira 7
other 12
daihatsu charade 6
cuore 77
materia 1
move 14
other 15
sirion 13
terios 3
jaguar other 17
s_type 10
x_type 24
jeep cherokee 29
grand 17
other 12
wrangler 12
kia carnival 52
ceed 9
other 63
picanto 29
rio 33
sorento 30
sportage 17
lada kalina 4
niva 26
other 17
samara 8
lancia delta 4
elefantino 1
kappa 2
lybra 11
musa 2
other 11
ypsilon 21
land_rover defender 13
discovery 9
freelander 27
other 4
range_rover 7
range_rover_sport 2
serie_1 2
serie_2 2
serie_3 1
mini clubman 6
cooper 63
one 27
other 14
mitsubishi carisma 56
colt 83
galant 32
lancer 35
other 92
outlander 10
pajero 21
porsche 911 28
boxster 15
cayenne 8
other 37
rover freelander 1
other 57
rangerover 1
saab 900 15
9000 2
other 12
skoda citigo 2
fabia 142
octavia 141
other 51
roomster 11
superb 10
yeti 1
subaru forester 9
impreza 21
justy 18
legacy 14
other 4
volvo 850 25
c_reihe 6
other 55
s60 1
v40 97
v50 8
v60 2
v70 47
xc_reihe 7
dtype: int64
brand model
alfa_romeo 145 11
147 39
156 58
159 10
other 39
spider 23
citroen berlingo 94
c1 32
c2 37
c3 48
c4 27
c5 41
other 285
honda accord 30
civic 130
cr_reihe 14
jazz 19
other 38
hyundai getz 60
i_reihe 56
other 140
santa 18
tucson 11
nissan almera 73
juke 4
micra 301
navara 14
note 7
other 79
primera 86
qashqai 26
x_trail 20
seat alhambra 27
altea 15
arosa 127
cordoba 60
ibiza 206
leon 33
other 31
toledo 43
suzuki grand 9
jimny 19
other 124
swift 69
toyota auris 15
avensis 35
aygo 41
corolla 82
other 94
rav 26
verso 16
yaris 90
trabant 601 165
other 38
dtype: int64
brand model
audi 100 33
200 2
80 212
90 15
a1 12
a2 39
a3 490
a4 729
a5 17
a6 363
a8 44
other 46
q3 1
q5 2
q7 17
tt 32
fiat 500 57
bravo 38
croma 7
doblo 32
ducato 98
other 244
panda 86
punto 518
seicento 129
stilo 71
ford b_max 1
c_max 36
escort 143
fiesta 726
focus 537
fusion 22
galaxy 148
ka 536
kuga 11
mondeo 433
mustang 32
other 203
s_max 5
transit 105
mazda 1_reihe 17
3_reihe 137
5_reihe 21
6_reihe 124
cx_reihe 4
mx_reihe 62
other 146
rx_reihe 12
peugeot 1_reihe 132
2_reihe 322
3_reihe 208
4_reihe 56
5_reihe 7
other 184
renault clio 523
espace 98
kangoo 181
laguna 181
megane 399
modus 38
other 104
r19 22
scenic 216
twingo 955
smart forfour 30
fortwo 423
other 24
roadster 12
dtype: int64
brand model
bmw 1er 123
3er 1534
5er 499
6er 13
7er 79
i3 3
m_reihe 17
other 85
x_reihe 125
z_reihe 22
mercedes_benz a_klasse 539
b_klasse 54
c_klasse 744
cl 21
clk 138
e_klasse 638
g_klasse 14
gl 1
glk 4
m_klasse 71
other 326
s_klasse 114
sl 49
slk 76
sprinter 132
v_klasse 24
viano 31
vito 127
opel agila 70
antara 13
astra 1111
calibra 22
combo 57
corsa 1770
insignia 18
kadett 74
meriva 61
omega 163
other 152
signum 29
tigra 72
vectra 519
vivaro 28
zafira 353
dtype: int64
brand model
volkswagen golf 2460
polo 1611
passat 883
transporter 496
touran 361
lupo 332
sharan 230
caddy 190
beetle 166
other 137
fox 75
bora 62
touareg 58
jetta 47
scirocco 27
tiguan 24
phaeton 20
eos 8
cc 8
up 4
amarok 1
dtype: int64
model_col_notna_top_vw['model'].unique()
vw0 = ['golf', 'polo', 'passat', 'transporter', 'touran', 'lupo', 'sharan',
'caddy', 'beetle', 'fox', 'bora', 'touareg', 'jetta',
'scirocco', 'tiguan', 'phaeton', 'eos', 'cc', 'up', 'amarok']
model_col_notna_top['model'].unique()
bmw0 = ['3er', '5er', 'x_reihe', '1er', '7er', 'z_reihe', 'm_reihe' '6er', 'i3']
merc0 = ['c_klasse', 'e_klasse', 'a_klasse', 'clk', 'sprinter', 'vito', 's_klasse', 'slk', 'm_klasse',
'b_klasse', 'sl', 'viano', 'v_klasse', 'cl', 'g_klasse', 'glk', 'gl']
opel0 = ['corsa', 'astra', 'vectra', 'zafira', 'omega', 'kadett', 'tigra', 'agila', 'meriva', 'combo',
'signum', 'vivaro', 'calibra', 'insignia', 'antara']
model_col_notna_top_middle['model'].unique()
audi0 = ['a4', 'a3', 'a6', '80', 'a8', 'a2','100', 'tt', 'a5', 'q7', '90', 'a1', '200', 'q5', 'q3']
fiat0 = ['punto', 'seicento', 'ducato', 'panda', 'stilo', '500', 'bravo', 'doblo', 'croma']
ford0 = ['fiesta', 'focus', 'ka', 'mondeo', 'galaxy', 'escort', 'transit', 'c_max', 'mustang',
'fusion', 'kuga', 's_max', 'b_max']
mazda0 = ['3_reihe', '6_reihe', 'mx_reihe', '5_reihe', '1_reihe', 'rx_reihe', 'cx_reihe']
peu0 = '2_reihe', '3_reihe', '1_reihe', '4_reihe', '5_reihe',
ren0 = ['twingo', 'clio', 'megane', 'scenic', 'kangoo', 'laguna', 'espace', 'modus', 'r19']
smart0 = ['fortwo', 'forfour', 'roadster']
vw_model = df[(df['brand'] == 'volkswagen') & (df['power'] == 0) & (df['model'].isin(vw0))]
pivot_table_min_vw = pd.pivot_table(vw_model, index = 'model', columns = 'vehicletype', values = 'registrationyear', aggfunc = ('min'))
pivot_table_min_vw.plot(kind = 'bar', figsize = (12,8))
plt.title("Volkswagen Models with 0hp Engines")
plt.ylim(1892, 2026)
plt.grid(True)
plt.tight_layout()
plt.show()
del model_col,model_col_p0,brand_p0,brand_p0_rows_bottom,brand_p0_rows_mbottom,brand_p0_rows_middle,brand_p0_rows_top,brand_p0_rows_vw
del model_col_notna_top_bottom,model_col_notna_top_mbottom,model_col_notna_top_middle,model_col_notna_top,model_col_notna_top_vw
del brand_p0_notna_top_bottom,brand_p0_notna_top_mbottom,brand_p0_notna_top_middle,brand_p0_notna_top,brand_p0_notna_top_vw
gc.collect()
21322
This is something that should have been fixed in the previous code, not just adding new code.
Is there a reason we are replacing the registration year for all Jetta models made after 2005 with 2016?
pivot = pd.pivot_table(df, index = 'model', columns = 'brand', values = 'price')
pivot.boxplot(vert = False, figsize = (12,8))
plt.show()
chevy = df[df['brand'] == 'chevrolet']
chevy_pivot = pd.pivot_table(chevy, index = 'registrationyear', columns = 'model', values = 'price')
chevy_pivot
chevy_pivot.boxplot(vert = False)
plt.show()
display(df[(df['vehicletype'] == 'suv') & (df['registrationyear'] > 2005) & (df['brand'] == 'chevrolet')].value_counts(subset = 'model'))
captiva = (df['vehicletype'] == 'suv') & (df['registrationyear'] > 2005) & (df['brand'] == 'chevrolet') & (df['model'] != 'other')
df.loc[captiva,['model']] = 'captiva'
df[(df['vehicletype'] == 'suv') & (df['registrationyear'] > 2005) & (df['brand'] == 'chevrolet')]
convertible = (df['brand'] == 'chevrolet') & (df['vehicletype'] == 'convertible')
df.loc[convertible,['model']] = 'other'
matiz68 = (df['brand'] == 'chevrolet') & (df['power'] == 68) & (df['price'] < 2600)
df.loc[matiz68,['model']] = 'matiz'
df.loc[matiz68,['vehicletype']] = 'small'
display(df[(df['brand'] == 'chevrolet') & (df['power'] == 68) & (df['price'] < 2600)])
matiz52 = (df['brand'] == 'chevrolet') & (df['power'] == 52)
df.loc[matiz52,['model']] = 'matiz'
df.loc[matiz52,['vehicletype']] = 'small'
display(df[(df['brand'] == 'chevrolet') & (df['power'] == 52)])
matiz67 = (df['brand'] == 'chevrolet') & (df['power'] == 67)
df.loc[matiz67,['model']] = 'matiz'
df.loc[matiz67,['vehicletype']] = 'small'
display(df[(df['brand'] == 'chevrolet') & (df['power'] == 67)])
peugeot = df[df['brand'] == 'peugeot']
peugeot_pivot = pd.pivot_table(peugeot,index = 'power', columns = 'model', values = 'price')
df[(df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57, 454]))]
re_1 = (df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57,454]))
df.loc[re_1,['vehicletype']] = 'small'
df.loc[re_1,['model']] = '1_reihe'
display(df[(df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57,454]))])
model captiva 164 other 19 dtype: int64
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 82008 | 08/03/2016 22:44 | 2599 | small | 2008.0 | manual | 68 | matiz | 100000 | 8 | NaN | chevrolet | NaN | 2016-08-03 | 0 | 44145 | 14/03/2016 06:16 | N |
| 140254 | 22/03/2016 21:36 | 1200 | small | 2005.0 | manual | 68 | matiz | 90000 | 5 | petrol | chevrolet | NaN | 2016-03-22 | 0 | 4155 | 24/03/2016 07:15 | N |
| 205903 | 14/03/2016 19:41 | 1799 | small | 2008.0 | manual | 68 | matiz | 100000 | 5 | petrol | chevrolet | no | 2016-03-14 | 0 | 24816 | 06/04/2016 04:17 | N |
| 257625 | 23/03/2016 10:38 | 1500 | small | 2005.0 | manual | 68 | matiz | 150000 | 11 | lpg | chevrolet | NaN | 2016-03-23 | 0 | 41238 | 24/03/2016 17:17 | NaN |
| 353189 | 19/03/2016 13:37 | 1200 | small | 2016.0 | manual | 68 | matiz | 90000 | 5 | petrol | chevrolet | NaN | 2016-03-19 | 0 | 4155 | 21/03/2016 17:50 | N |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 373 | 02/04/2016 12:39 | 1350 | small | 2005.0 | manual | 52 | matiz | 150000 | 6 | petrol | chevrolet | yes | 2016-02-04 | 0 | 91207 | 06/04/2016 10:17 | N |
| 2263 | 27/03/2016 19:55 | 2399 | small | 2016.0 | manual | 52 | matiz | 80000 | 7 | petrol | chevrolet | NaN | 2016-03-27 | 0 | 33605 | 05/04/2016 18:45 | N |
| 2820 | 26/03/2016 20:47 | 3350 | small | 2010.0 | manual | 52 | matiz | 80000 | 2 | petrol | chevrolet | no | 2016-03-26 | 0 | 18273 | 06/04/2016 11:17 | N |
| 5636 | 30/03/2016 08:55 | 3650 | small | 2009.0 | manual | 52 | matiz | 50000 | 7 | petrol | chevrolet | no | 2016-03-30 | 0 | 26789 | 30/03/2016 08:55 | N |
| 7123 | 04/04/2016 18:39 | 2500 | small | 2008.0 | manual | 52 | matiz | 125000 | 12 | petrol | chevrolet | no | 2016-04-04 | 0 | 21493 | 06/04/2016 20:44 | N |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340075 | 17/03/2016 21:37 | 4999 | small | 2010.0 | auto | 52 | matiz | 30000 | 3 | petrol | chevrolet | no | 2016-03-17 | 0 | 45329 | 17/03/2016 22:40 | N |
| 340549 | 29/03/2016 15:57 | 1599 | small | 2009.0 | manual | 52 | matiz | 80000 | 5 | petrol | chevrolet | no | 2016-03-29 | 0 | 20357 | 06/04/2016 02:15 | N |
| 344585 | 13/03/2016 17:50 | 2100 | small | 2009.0 | manual | 52 | matiz | 125000 | 11 | petrol | chevrolet | no | 2016-03-13 | 0 | 22869 | 28/03/2016 14:16 | N |
| 349474 | 08/03/2016 13:25 | 2600 | small | 2009.0 | manual | 52 | matiz | 50000 | 3 | petrol | chevrolet | no | 2016-08-03 | 0 | 65719 | 11/03/2016 09:45 | N |
| 349800 | 01/04/2016 22:38 | 1950 | small | 2008.0 | manual | 52 | matiz | 60000 | 9 | petrol | chevrolet | no | 2016-01-04 | 0 | 42369 | 01/04/2016 23:41 | N |
101 rows × 17 columns
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1981 | 27/03/2016 18:43 | 2990 | small | 2007.0 | manual | 67 | matiz | 125000 | 4 | lpg | chevrolet | no | 2016-03-27 | 0 | 72108 | 05/04/2016 15:15 | N |
| 3769 | 01/04/2016 15:53 | 1500 | small | 2016.0 | manual | 67 | matiz | 125000 | 10 | NaN | chevrolet | NaN | 2016-01-04 | 0 | 4158 | 07/04/2016 13:50 | N |
| 5215 | 26/03/2016 08:55 | 2900 | small | 2010.0 | manual | 67 | matiz | 80000 | 4 | petrol | chevrolet | no | 2016-03-26 | 0 | 25421 | 03/04/2016 19:47 | N |
| 7757 | 21/03/2016 09:52 | 3750 | small | 2007.0 | manual | 67 | matiz | 70000 | 10 | lpg | chevrolet | no | 2016-03-21 | 0 | 53945 | 06/04/2016 02:45 | N |
| 9006 | 14/03/2016 11:38 | 2750 | small | 2007.0 | manual | 67 | matiz | 70000 | 10 | petrol | chevrolet | no | 2016-03-14 | 0 | 21029 | 07/04/2016 12:45 | N |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340326 | 02/04/2016 22:51 | 2150 | small | 2007.0 | manual | 67 | matiz | 150000 | 12 | petrol | chevrolet | no | 2016-02-04 | 0 | 31863 | 07/04/2016 00:45 | N |
| 344984 | 26/03/2016 22:54 | 2100 | small | 2007.0 | manual | 67 | matiz | 125000 | 6 | petrol | chevrolet | no | 2016-03-26 | 0 | 48565 | 04/04/2016 22:47 | N |
| 348552 | 04/04/2016 13:46 | 2250 | small | 2006.0 | manual | 67 | matiz | 150000 | 7 | lpg | chevrolet | no | 2016-04-04 | 0 | 33397 | 06/04/2016 14:46 | N |
| 351693 | 28/03/2016 17:41 | 1100 | small | 2006.0 | manual | 67 | matiz | 150000 | 6 | petrol | chevrolet | no | 2016-03-28 | 0 | 46537 | 06/04/2016 23:15 | N |
| 352283 | 12/03/2016 15:46 | 1950 | small | 2007.0 | manual | 67 | matiz | 90000 | 8 | petrol | chevrolet | no | 2016-12-03 | 0 | 48529 | 15/03/2016 21:16 | N |
91 rows × 17 columns
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 44179 | 02/04/2016 17:52 | 500 | small | 1998.0 | auto | 7 | 1_reihe | 100000 | 11 | petrol | peugeot | no | 2016-02-04 | 0 | 66271 | 02/04/2016 17:52 | N |
| 154470 | 07/03/2016 10:52 | 100 | small | 1995.0 | manual | 42 | 1_reihe | 150000 | 6 | petrol | peugeot | NaN | 2016-07-03 | 0 | 1665 | 15/03/2016 22:16 | N |
| 174795 | 10/03/2016 23:44 | 150 | small | 1997.0 | manual | 33 | 1_reihe | 150000 | 11 | petrol | peugeot | yes | 2016-10-03 | 0 | 66333 | 11/03/2016 12:17 | N |
| 186556 | 20/03/2016 16:55 | 430 | small | 2016.0 | NaN | 33 | 1_reihe | 150000 | 9 | petrol | peugeot | NaN | 2016-03-20 | 0 | 73525 | 04/04/2016 20:44 | N |
| 191097 | 23/03/2016 22:51 | 0 | small | 1997.0 | manual | 33 | 1_reihe | 125000 | 6 | NaN | peugeot | yes | 2016-03-23 | 0 | 86343 | 06/04/2016 06:45 | NaN |
| 204925 | 29/03/2016 15:45 | 850 | small | 1997.0 | manual | 57 | 1_reihe | 150000 | 2 | petrol | peugeot | no | 2016-03-29 | 0 | 16909 | 06/04/2016 01:16 | N |
| 210942 | 30/03/2016 15:51 | 700 | small | 1998.0 | manual | 454 | 1_reihe | 150000 | 8 | petrol | peugeot | NaN | 2016-03-30 | 0 | 85598 | 30/03/2016 15:51 | N |
| 262687 | 05/03/2016 16:52 | 0 | small | 1996.0 | manual | 48 | 1_reihe | 150000 | 7 | petrol | peugeot | yes | 2016-05-03 | 0 | 26441 | 24/03/2016 18:45 | N |
| 314981 | 20/03/2016 04:02 | 700 | small | 2016.0 | manual | 33 | 1_reihe | 150000 | 7 | petrol | peugeot | no | 2016-03-20 | 0 | 28759 | 23/03/2016 22:17 | N |
| 323988 | 10/03/2016 22:50 | 1033 | small | 1996.0 | manual | 43 | 1_reihe | 150000 | 10 | petrol | peugeot | no | 2016-10-03 | 0 | 42277 | 24/03/2016 20:18 | N |
wagon = df[(df['vehicletype'] == 'wagon') & (df['price'] > 0)]
wagon['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand')
plt.show()
wagon.groupby('brand')['price'].mean().sort_values(ascending=False).plot(kind='bar', figsize=(10,5), title='Average Wagon Price per Brand')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=wagon, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand')
plt.grid()
plt.show()
Wagon Type Vehicles Against Price
| Brand | Vehicle Type | (~)Count | Avg Price | Distribution (25 - 75) |
|---|---|---|---|---|
| volkswagen | Wagon | 12,500 | 5,000 | 1,250 - 7,000 |
| audi | Wagon | 11,000 | 7,000 | 2,500 - 11,000 |
| bmw | Wagon | 8,000 | 7,000 | 2,300 - 9,500 |
| opel | Wagon | 7,000 | 3,500 | 1,000 - 4,500 |
| mercedes_benz | Wagon | 6,500 | 6,000 | 1,500 - 8,500 |
| ford | Wagon | 5,900 | 6,000 | 1,500 - 8,000 |
| skoda | Wagon | 3,000 | 6,500 | 2,000 - 9,000 |
| volvo | Wagon | 2,200 | 5,500 | 2,000 - 7,500 |
| renault | Wagon | 2,000 | 3,000 | 1,000 - 4,000 |
| peugeot | Wagon | 1,800 | 4,900 | 1,500 - 6,500 |
| mazda | Wagon | 1,000 | 4,800 | 2,000 - 6,500 |
| toyota | Wagon | 800 | 4,700 | 2,000 - 6,500 |
| alfa_romeo | Wagon | 600 | 4,400 | 1,500 - 6,000 |
| fiat | Wagon | 500 | 2,200 | 1,000 - 3,000 |
| seat | Wagon | 500 | 4,000 | 1,500 - 5,500 |
| nissan | Wagon | 400 | 3,800 | 1,500 - 5,000 |
| citroen | Wagon | 400 | 3,700 | 1,500 - 5,000 |
| mitsubishi | Wagon | 300 | 1,800 | 800 - 2,500 |
| dacia | Wagon | 300 | 3,700 | 2,000 - 5,000 |
| chevrolet | Wagon | 200 | 3,500 | 1,500 - 5,000 |
| hyundai | Wagon | 200 | 11,500 | 6,000 - 15,000 |
| kia | Wagon | 200 | 3,300 | 1,500 - 4,500 |
| mini | Wagon | 100 | 8,000 | 4,000 - 11,000 |
| subaru | Wagon | <100 | 4,000 | 2,000 - 5,500 |
| honda | Wagon | <100 | 3,000 | 1,500 - 4,000 |
| chrysler | Wagon | <100 | 2,800 | 1,000 - 4,000 |
| saab | Wagon | <100 | 2,800 | 1,000 - 4,000 |
| suzuki | Wagon | <100 | 2,300 | 1,000 - 3,000 |
| smart | Wagon | <100 | 2,200 | 1,000 - 3,000 |
| lancia | Wagon | <100 | 2,000 | 800 - 3,000 |
| daewoo | Wagon | <100 | 900 | 500 - 1,200 |
| jaguar | Wagon | <100 | 1,800 | 1,000 - 2,500 |
| land_rover | Wagon | <100 | 2,900 | 1,500 - 4,000 |
| lada | Wagon | <100 | 1,700 | 800 - 2,500 |
| rover | Wagon | <100 | 1,600 | 800 - 2,200 |
| trabant | Wagon | <100 | 1,800 | 1,000 - 2,500 |
df[(df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & \
(df['price'] < 7000) & (df['registrationyear'] > 1996) & (df['registrationyear'] < 1999) & (df['power'].isin([150]))]
passat = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 1999) & (df['power'].isin([150]))
df.loc[passat,['model']] = 'passat'
passat1 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] == 1991) & \
(df['power'].isin([90,136]))
df.loc[passat1,['model']] = 'passat'
passat2 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] == 1992) & \
(df['model'].isna())
df.loc[passat2,['model']] = 'passat'
passat3 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & \
(df['registrationyear'].isin([1982,1993,1994])) & (df['model'].isna())
df.loc[passat3,['model']] = 'passat'
passat4 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'].isin([1996])) & \
(df['power'].isin([174])) & (df['model'].isna())
df.loc[passat4,['model']] = 'passat'
passat5 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['power'].isin([125])) & (df['price'] > 1250) & \
(df['price'] < 7000)
df.loc[passat5,['model']] = 'passat'
passat6 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['power'].isin([110,193])) & (df['price'] > 1250) & (df['price'] < 7000) & \
(df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[passat6, ['model']] = 'passat'
golf = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'].isin([1996])) & \
(df['power'].isin([75,110])) & (df['model'].isna())
df.loc[golf,['model']] = 'golf'
passat140 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['registrationyear'] > 2004) & \
(df['registrationyear'] < 2007) & (df['power'].isin([140]))
df.loc[passat140,['model']] = 'passat'
golf90 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2000) & \
(df['registrationyear'] < 2005) & (df['power'].isin([90]))
df.loc[golf90,['model']] = 'golf'
passat90 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1985) & \
(df['registrationyear'] < 1993) & (df['power'].isin([90]))
df.loc[passat90,['model']] = 'passat'
golf75 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1993) & \
(df['registrationyear'] < 1995) & (df['power'].isin([75]))
df.loc[golf75,['model']] = 'golf'
golf7502 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2001) & \
(df['registrationyear'] < 2003) & (df['power'].isin([75]))
df.loc[golf7502,['model']] = 'golf'
passat105 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1995) & \
(df['registrationyear'] < 1998) & (df['power'].isin([105]))
df.loc[passat105,['model']] = 'passat'
passat131 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1999) & \
(df['registrationyear'] < 2002) & (df['power'].isin([131]))
df.loc[passat131,['model']] = 'passat'
passat116 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1989) \
& (df['registrationyear'] < 1997) & (df['power'].isin([116]))
df.loc[passat116,['model']] = 'passat'
passat150 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1995) \
& (df['registrationyear'] < 2006) & (df['power'].isin([150]))
df.loc[passat150,['model']] = 'passat'
passat115 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) \
& (df['registrationyear'] < 1997) & (df['power'].isin([115]))
df.loc[passat115,['model']] = 'passat'
passat170 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2004) & \
(df['registrationyear'] < 2012) & (df['power'].isin([170]))
df.loc[passat170,['model']] = 'passat'
golf110 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2013) & \
(df['registrationyear'] < 2017) & (df['power'].isin([60]))
df.loc[golf110,['model']] = 'golf'
golf60 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) & \
(df['registrationyear'] < 1996) & (df['power'].isin([60]))
df.loc[golf60,['model']] = 'golf'
polo60 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 2001) & (df['power'].isin([60]))
df.loc[polo60,['model']] = 'polo'
passat125 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 2000) & (df['power'].isin([125]))
df.loc[passat125,['model']] = 'passat'
passat100 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) & \
(df['registrationyear'] < 2005) & (df['power'].isin([100]))
df.loc[passat100,['model']] = 'passat'
passat174 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1993) & \
(df['registrationyear'] < 1997) & (df['power'].isin([174]))
df.loc[passat174,['model']] = 'passat'
passat130 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1998) & \
(df['registrationyear'] < 2005) & (df['power'].isin([130]))
df.loc[passat130,['model']] = 'passat'
passat120 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1980) & (df['registrationyear'] < 2000) & (df['power'].isin([120]))
df.loc[passat120,['model']] = 'passat'
vw_small75 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1985,1992]))
df.loc[vw_small75,['model']] = 'golf'
vw_sedan75 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1993) & (df['registrationyear'] < 2007)
df.loc[vw_sedan75,['model']] = 'golf'
opel_sedan84 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1984]))
df.loc[opel_sedan84,['model']] = 'kadett'
opel_sedan94 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1994,1999,2000]))
df.loc[opel_sedan94,['model']] = 'astra'
opel_sedan04 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([2004,2008]))
df.loc[opel_sedan04,['model']] = 'corsa'
ford_sedan99 = (df['brand'] == 'ford') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1999,2001,2003]))
df.loc[ford_sedan99,['model']] = 'focus'
opel_wagon96 = (df['brand'] == 'opel') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['power'].isin([75])) & (df['registrationyear'] > 1995) \
& (df['registrationyear'] < 2001)
df.loc[opel_wagon96,['model']] = 'astra'
opel_small01 = (df['brand'] == 'opel') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) \
& (df['registrationyear'].isin([2001, 2002, 2003, 2004, 2006, 2008]))
df.loc[opel_small01,['model']] = 'corsa'
renault_small91 = (df['brand'] == 'renault') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1990) & (df['registrationyear'] < 2001)
df.loc[renault_small91,['model']] = 'clio'
peugeot_small92 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1992]))
df.loc[peugeot_small92,['model']] = '1_reihe'
peugeot_small94 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1994]))
df.loc[peugeot_small94,['model']] = '3_reihe'
peugeot_small00 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1999) & (df['registrationyear'] < 2010)
df.loc[peugeot_small00,['model']] = '2_reihe'
del vw0, bmw0, merc0, opel0, audi0, fiat0, ford0, mazda0, peu0, ren0, smart0, chevy, chevy_pivot, captiva, convertible, matiz68, matiz52, matiz67
del peugeot, peugeot_pivot, re_1, wagon, passat, passat1, passat2, passat3, passat4, passat5, passat6, golf
del passat140, golf90, passat90, golf75, golf7502, passat105, passat131, passat116, passat150, passat115, passat170, golf110, golf60, polo60
del passat125, passat100, passat174, passat130, passat120
del vw_small75,vw_sedan75,opel_sedan84,opel_sedan94,opel_sedan04,ford_sedan99,opel_wagon96,opel_small01,renault_small91,peugeot_small92
del peugeot_small94,peugeot_small00
gc.collect()
25995
brand_power = df[(df['power'].isin([75,60,150,101,140,90,116,105,170,125,136,102])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')].value_counts(subset = 'brand')
brand_power.plot(kind = 'bar')
plt.title("Brands with Top HP counts")
plt.grid()
plt.show()
brand_power1 = df[(df['power'].isin([75,60])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power2 = df[(df['power'].isin([150,101])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power3 = df[(df['power'].isin([140,90])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power4 = df[(df['power'].isin([116,105])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power5 = df[(df['power'].isin([170,125])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power6 = df[(df['power'].isin([136,102])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
top5_brand_power = ['volkswagen','opel','bmw','audi','ford']
over1000_brand_power = ['mercedes_benz', 'renault', 'peugeot', 'seat', 'skoda', 'fiat', 'citroen', 'honda', 'mazda', 'mini', 'nissan', 'mitsubishi', 'volvo']
under1000_brand_power = ['toyota', 'alfa_romeo', 'hyundai', 'kia', 'dacia', 'suzuki', 'chrysler', 'subaru', 'smart', 'chevrolet', 'saab', 'lancia',
'rover', 'jeep', 'daihatsu', 'daewoo', 'porsche', 'lada', 'land_rover', 'jaguar']
top5_brands = brand_power1[brand_power1['brand'].isin(top5_brand_power)]
top5_brands2 = brand_power2[brand_power2['brand'].isin(top5_brand_power)]
top5_brands3 = brand_power3[brand_power3['brand'].isin(top5_brand_power)]
top5_brands4 = brand_power4[brand_power4['brand'].isin(top5_brand_power)]
top5_brands5 = brand_power5[brand_power5['brand'].isin(top5_brand_power)]
top5_brands6 = brand_power6[brand_power6['brand'].isin(top5_brand_power)]
middle_brands = brand_power1[brand_power1['brand'].isin(over1000_brand_power)]
middle_brands2 = brand_power2[brand_power2['brand'].isin(over1000_brand_power)]
middle_brands3 = brand_power3[brand_power3['brand'].isin(over1000_brand_power)]
middle_brands4 = brand_power4[brand_power4['brand'].isin(over1000_brand_power)]
middle_brands5 = brand_power5[brand_power5['brand'].isin(over1000_brand_power)]
middle_brands6 = brand_power6[brand_power6['brand'].isin(over1000_brand_power)]
lower_brands = brand_power1[brand_power1['brand'].isin(under1000_brand_power)]
lower_brands2 = brand_power2[brand_power2['brand'].isin(under1000_brand_power)]
lower_brands3 = brand_power3[brand_power3['brand'].isin(under1000_brand_power)]
lower_brands4 = brand_power4[brand_power4['brand'].isin(under1000_brand_power)]
lower_brands5 = brand_power5[brand_power5['brand'].isin(under1000_brand_power)]
lower_brands6 = brand_power6[brand_power6['brand'].isin(under1000_brand_power)]
# Use known model and power to find Nan
top5 = top5_brands[['brand','model','power']].value_counts().sort_index()
top52 = top5_brands2[['brand','model','power']].value_counts().sort_index()
top53 = top5_brands3[['brand','model','power']].value_counts().sort_index()
top54 = top5_brands4[['brand','model','power']].value_counts().sort_index()
top55 = top5_brands5[['brand','model','power']].value_counts().sort_index()
top56 = top5_brands6[['brand','model','power']].value_counts().sort_index()
middle = middle_brands[['brand','model','power']].value_counts().sort_index()
middle2 = middle_brands2[['brand','model','power']].value_counts().sort_index()
middle3 = middle_brands3[['brand','model','power']].value_counts().sort_index()
middle4 = middle_brands4[['brand','model','power']].value_counts().sort_index()
middle5 = middle_brands5[['brand','model','power']].value_counts().sort_index()
middle6 = middle_brands6[['brand','model','power']].value_counts().sort_index()
lower = lower_brands[['brand','model','power']].value_counts().sort_index()
lower2 = lower_brands2[['brand','model','power']].value_counts().sort_index()
lower3 = lower_brands3[['brand','model','power']].value_counts().sort_index()
lower4 = lower_brands4[['brand','model','power']].value_counts().sort_index()
lower5 = lower_brands5[['brand','model','power']].value_counts().sort_index()
lower6 = lower_brands6[['brand','model','power']].value_counts().sort_index()
print("Batch 1: HP [60 & 70]")
# Top 5 Prevalent Brands w/ specified HP [60 & 70]
top5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
# Middle Prevalent Brands w/ specified HP [60 & 70]
middle.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
# Lower Prevalent Brands w/ specified HP [60 & 70]
lower.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 2: HP [150 & 101]")
# Batch 2: HP [150 & 101]
top52.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle2.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower2.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 3: HP [140 & 90]")
# Batch 3: HP [140 & 90]
top53.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle3.plot(kind = 'bar', x = ('brand','model','power'), figsize = (20,8))
plt.title('Middle: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower3.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 4: HP [116 & 105]")
# Batch 4: HP [116 & 105]
top54.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle4.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower4.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 5: HP [170 & 125]")
# Batch 5: HP [170 & 125]
top55.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 6: HP [136 & 102]")
# Batch 6: HP [136 & 102]
top56.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle6.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower6.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
Batch 1: HP [60 & 70]
Batch 2: HP [150 & 101]
Batch 3: HP [140 & 90]
Batch 4: HP [116 & 105]
Batch 5: HP [170 & 125]
Batch 6: HP [136 & 102]
display(df[(df['brand'].isin(['audi'])) & (df['power'].isin([75])) & (df['vehicletype'] == 'small') & (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['opel'])) & (df['vehicletype'] != 'small') & (df['power'].isin([60])) & (df['registrationyear'] > 1991) & (df['registrationyear'] < 1993)].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['mini'])) & (df['power'].isin([75])) & (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['seat'])) & (df['registrationyear'] > 2002) & (df['registrationyear'] < 2012) & (df['power'].isin([75]))& (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['skoda'])) & (df['registrationyear'] > 2000) & (df['registrationyear'] != 2013) & (df['power'].isin([75]))& (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['smart'])) & (df['power'].isin([60]))& (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['bmw'])) & (df['power'].isin([101])) & (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'sedan') & (df['registrationyear'] == 1999) & (df['power'].isin([150])) & (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'bus') & (df['power'].isin([150])) & (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'].isin(['honda'])) & (df['power'].isin([101])) & (df['model'].notna())].value_counts(subset = 'model'))
model a2 210 a3 2 dtype: int64
model astra 4 dtype: int64
model one 118 cooper 3 other 1 dtype: int64
model ibiza 322 cordoba 13 altea 1 dtype: int64
model fabia 479 citigo 16 octavia 9 other 3 roomster 2 dtype: int64
model fortwo 47 forfour 2 other 2 roadster 1 dtype: int64
model 3er 212 1er 5 other 4 m_reihe 1 dtype: int64
model galant 10 dtype: int64
model other 32 dtype: int64
model civic 31 jazz 2 dtype: int64
audi75 = (df['brand'].isin(['audi'])) & (df['power'].isin([75])) & (df['vehicletype'] == 'small') & (df['model'].isna())
df.loc[audi75,['model']] = 'a2'
opelastra = (df['brand'].isin(['opel'])) & (df['vehicletype'] != 'small') & (df['power'].isin([60])) & (df['registrationyear'] > 1991) & (df['registrationyear'] < 1993)& (df['model'].isna())
df.loc[opelastra,['model']] = 'astra'
mini75 = (df['brand'].isin(['mini'])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[mini75,['model']] = 'one'
nissan60 = (df['brand'].isin(['nissan'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[nissan60,['model']] = 'micra'
ibiza03 = (df['brand'].isin(['seat'])) & (df['registrationyear'] > 2002) & (df['registrationyear'] < 2012) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[ibiza03,['model']] = 'ibiza'
skoda60 = (df['brand'].isin(['skoda'])) & (df['registrationyear'] > 2000) & (df['registrationyear'] != 2013) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[skoda60,['model']] = 'fabia'
lancia60 = (df['brand'].isin(['lancia'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[lancia60,['model']] = 'ypsilon'
smart60 = (df['brand'].isin(['smart'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[smart60,['model']] = 'fortwo'
bmw101 = (df['brand'].isin(['bmw'])) & (df['power'].isin([101])) & (df['model'].isna())
df.loc[bmw101,['model']] = '3er'
mit99 = (df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'sedan') & (df['registrationyear'] == 1999) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mit99,['model']] = 'galant'
mitbus00 = (df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'bus') & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mitbus00,['model']] = 'other'
honda101 = (df['brand'].isin(['honda'])) & (df['power'].isin([101])) & (df['model'].isna())
df.loc[honda101,['model']] = 'civic'
topbrand_vt = ['volkswagen']
vt_power = df[(df['brand'].notna()) & (df['brand'].isin(topbrand_vt)) & (df['model'].notna()) & (df['vehicletype'].notna())]
vwvt = vt_power[['vehicletype','model']].value_counts().sort_index()
vwvt.plot(kind = 'bar', figsize = (16,8))
plt.title("Volkswagen: Model & Vehicle Type Abundance")
plt.grid()
plt.show()
# VW GOLF
golf = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([211,230,174,102, 122, 350, 250, 170, 86, 200,100,109,190,68,80,72,131,144,129,77,160,76,204])) & (df['model'].isna())
df.loc[golf,['model']] = 'golf'
golf02 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([90])) & (df['registrationyear'] > 2002) & (df['model'].isna())
df.loc[golf02,['model']] = 'golf'
golf98 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([60])) & (df['registrationyear'] < 1998) & (df['model'].isna())
df.loc[golf98,['model']] = 'golf'
golf09 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([125,110])) & (df['registrationyear'] == 2009) & (df['model'].isna())
df.loc[golf09,['model']] = 'golf'
golf99 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([150])) & (df['registrationyear'] > 1999) & (df['model'].isna())
df.loc[golf99,['model']] = 'golf'
golf04 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([140])) & (df['registrationyear'] == 2004) & (df['model'].isna())
df.loc[golf04,['model']] = 'golf'
golf91 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([55])) & (df['registrationyear'] != 1991) & (df['model'].isna())
df.loc[golf91,['model']] = 'golf'
# VW POLO
polo = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([64,54])) & (df['model'].isna())
df.loc[polo,['model']] = 'polo'
polo98 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([60])) & (df['registrationyear'] > 1998) & (df['model'].isna())
df.loc[polo98,['model']] = 'polo'
# VW PASSAT
passat = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([148,136])) & (df['model'].isna())
df.loc[passat,['model']] = 'passat'
passat97 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([125])) & (df['registrationyear'] == 1997) & (df['model'].isna())
df.loc[passat97,['model']] = 'passat'
### VW BEETLE
beetle = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([30])) & (df['model'].isna())
df.loc[beetle,['model']] = 'beetle'
#### VW JETTA
jetta = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([70])) & (df['registrationyear'] == 1981) & (df['model'].isna())
df.loc[jetta,['model']] = 'jetta'
### VW PHAETON
phaeton = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([313,420,240])) & (df['model'].isna())
df.loc[phaeton,['model']] = 'phaeton'
phaeton05 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['registrationyear'] == 2005) & (df['power'].isin([224])) & (df['model'].isna())
df.loc[phaeton05,['model']] = 'phaeton'
### Wagon
trabant = (df['vehicletype'] == 'wagon') & (df['brand'] == 'trabant') & (df['model'].isna())
df.loc[trabant,['model']] = '601'
bmw = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1990) & (df['brand'] == 'bmw') & (df['model'].isna())
df.loc[bmw,['model']] = '3er'
vw80 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1990) & (df['brand'] == 'volkswagen') & (df['model'].isna())
df.loc[vw80,['model']] = 'passat'
opel82 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1982) & (df['brand'] == 'opel') & (df['model'].isna())
df.loc[opel82,['model']] = 'kadett'
other82 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1988) & (df['brand'] != 'sonstige_autos') & (df['model'].isna())
df.loc[other82,['model']] = 'other'
volvo89 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1989) & (df['brand'] == 'volvo') & (df['model'].isna())
df.loc[volvo89,['model']] = 'other'
audi100 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1990) & (df['brand'] == 'audi') & (df['model'].isna())
df.loc[audi100,['model']] = '100'
### HP specific
freelander = (df['brand'] == 'land_rover') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([111,115,60,129,140,109,])) & (df['registrationyear'] > 1992) & (df['registrationyear'] < 2006) & (df['model'].isna())
df.loc[freelander,['model']] = 'freelander'
ypsilon = (df['brand'] == 'lancia') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([44,70,74,75,602,1200])) & (df['model'].isna())
df.loc[ypsilon,['model']] = 'ypsilon'
logan = (df['brand'] == 'dacia') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([75,84,85,105])) & (df['registrationyear'].isin([2009,2012,2013,2015])) & (df['model'].isna())
df.loc[logan,['model']] = 'logan'
porscheother = (df['brand'] == 'porsche') & (df['vehicletype'] == 'coupe') & (df['power'].isin([125,160])) & (df['registrationyear'].isin([1981,1989])) & (df['model'].isna())
df.loc[porscheother,['model']] = 'other'
justy = (df['brand'] == 'subaru') & (df['vehicletype'] == 'small') & (df['power'].isin([25,34,50,60,68])) & (df['registrationyear'].isin([1996,1997,2000])) & (df['model'].isna())
df.loc[justy,['model']] = 'justy'
otherrover = (df['brand'] == 'rover') & (df['vehicletype'] == 'sedan') & (df['power'].isin([75,100,111,120,150,16,77,85,105,16,77,85,105,108,116,130,174])) & (df['registrationyear'].isin([1996,1997,1998,1999,2000,2001,2002,2003])) & (df['model'].notna())
df.loc[otherrover,['model']] = 'other'
chryslerother = (df['brand'] == 'chrysler') & (df['vehicletype'] == 'sedan') & (df['power'].isin([133,254,250,85,100,109,122,137,186])) & (df['registrationyear'].isin([1952,1977,1996,1998,1999,2000,2002,2008,2010])) & (df['model'].isna())
df.loc[chryslerother,['model']] = 'other'
voyager = (df['brand'] == 'chrysler') & (df['vehicletype'] == 'bus') & (df['power'].isin([151])) & (df['registrationyear'].isin([1996,1997,1999])) & (df['model'].isna())
df.loc[voyager,['model']] = 'voyager'
t601 = (df['brand'] == 'trabant') & (df['vehicletype'] == 'sedan') & (df['power'].isin([26,45])) & (df['registrationyear'].isin([1982,1988,1989,1977,1986,1984,1998])) & (df['model'].isna())
df.loc[t601,['model']] = '601'
six = (df['brand'] == 'trabant') & (df['vehicletype'].isin(['small','coupe'])) & (df['power'].isin([60,26,75])) & (df['registrationyear'].isin([1988,1998,2004,2008])) & (df['model'].isna())
df.loc[six,['model']] = '601'
otherchevy = (df['brand'] == 'chevrolet') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([64,141,75,94,95,54,109,125,195,163,130,105,124,72,69,60,360])) & (df['registrationyear'].isin([2011,2005,1968,1978,2000,2006,2010,2012])) & (df['model'].isna())
df.loc[otherchevy,['model']] = 'other'
volvoother = (df['brand'] == 'volvo') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([115,131,52,105,113,116])) & (df['registrationyear'].isin([1996,1991,1993,2007,1985,1988,1998,1999,2004,2012])) & (df['model'].isna())
df.loc[volvoother,['model']] = 'other'
kother = (df['brand'] == 'kia') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105,138,140,48,101,113,133,143,203])) & (df['registrationyear'].isin([2005,2007,2001,2002,2003,2004])) & (df['model'].isna())
df.loc[kother,['model']] = 'other'
rio = (df['brand'] == 'kia') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([97,109,83,98,105,125,138,139,150])) & (df['registrationyear'].isin([2003,2000,2007,1999,2001,2002])) & (df['model'].isna())
df.loc[rio,['model']] = 'rio'
sorento = (df['brand'] == 'kia') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([140,78,110,133,194])) & (df['registrationyear'].isin([2006,2001,2004,1995,1999,2012])) & (df['model'].isna())
df.loc[sorento,['model']] = 'sorento'
civic = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([90,124,125])) & (df['registrationyear'].isin([1992,1991,1993])) & (df['model'].isna())
df.loc[civic,['model']] = 'civic'
jazz = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([2010])) & (df['model'].isna())
df.loc[jazz,['model']] = 'jazz'
hother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65])) & (df['registrationyear'].isin([1999])) & (df['model'].isna())
df.loc[hother,['model']] = 'other'
civcou = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([90,114,100,105,107,109])) & (df['registrationyear'].isin([2000,1995,1996,1998,1989,1999,2006])) & (df['model'].isna())
df.loc[civcou,['model']] = 'civic'
honother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([133,185])) & (df['registrationyear'].isin([2000,1992,1998])) & (df['model'].isna())
cother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([133])) & (df['registrationyear'].isin([2000,1992,1998,])) & (df['model'].isna())
df.loc[cother,['model']] = 'other'
jbus = (df['brand'] == 'honda') & (df['vehicletype'].isin(['bus'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([2010,2012,2013])) & (df['model'].isna())
df.loc[jbus,['model']] = 'jazz'
octavia = (df['brand'] == 'skoda') & (df['price'] > 2099) & (df['price'] < 5701) & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([102,105,150])) & (df['registrationyear'].isin([2001,2005,2007,2008])) & (df['model'].isna())
df.loc[octavia,['model']] = octavia
swift = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([53,50,55,58,92])) & (df['registrationyear'].isin([1997,2000,1998,2003,2008])) & (df['model'].isna())
df.loc[swift,['model']] = 'swift'
suzother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([63,52,65,76,83,84,57,96])) & (df['registrationyear'].isin([1990, 1995,1996,1999,2002,1997,2001,2004,2007])) & (df['model'].isna())
df.loc[suzother,['model']] = 'other'
ukiother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2009,2011,2012])) & (df['model'].isna())
df.loc[ukiother,['model']] = 'other'
jimny = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([86,82,88])) & (df['registrationyear'].isin([2001,2005,2003])) & (df['model'].isna())
df.loc[jimny,['model']] = 'jimny'
zother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([97,45,68,75,85,98,136,170])) & (df['registrationyear'].isin([1995,1996,1988,1992,1998,2006,2007])) & (df['model'].isna())
df.loc[zother,['model']] = 'other'
carisma = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([125,115])) & (df['registrationyear'].isin([2002,1995,1998,1997,2000,2003])) & (df['model'].isna())
df.loc[carisma,['model']] = 'carisma'
colt = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,90,95])) & (df['registrationyear'].isin([2002,2009,2000,2006])) & (df['model'].isna())
df.loc[colt,['model']] = 'colt'
coltt = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,70,95,82,150])) & (df['registrationyear'].isin([1999,1996,1998,2006,2009,1997,2000,2002,2010,2012,2001])) & (df['model'].isna())
df.loc[coltt,['model']] = 'colt'
lancer = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004,1997,2007])) & (df['model'].isna())
df.loc[lancer,['model']] = 'lancer'
galant = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([160,165])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004,1997,2007])) & (df['model'].isna())
df.loc[galant,['model']] = 'galant'
wother = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([82,86,83,101,132,125])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004])) & (df['model'].isna())
df.loc[wother,['model']] = 'other'
yaris = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,86,90,87])) & (df['registrationyear'].isin([2008,2000,2001,2002])) & (df['model'].isna())
df.loc[yaris,['model']] = 'yaris'
aygo = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2008,2006,2009])) & (df['model'].isna())
df.loc[aygo,['model']] = 'aygo'
yar = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([1999,2001])) & (df['model'].isna())
df.loc[yar,['model']] = 'yaris'
cor = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([97])) & (df['registrationyear'].isin([2003,2000,2001])) & (df['model'].isna())
df.loc[cor,['model']] = 'corolla'
corolla = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([86])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[corolla,['model']] = 'corolla'
sixty = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([1997])) & (df['model'].isna())
df.loc[sixty,['model']] = 'other'
tother = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1993,1995,1997])) & (df['model'].isna())
df.loc[tother,['model']] = 'other'
coro = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,88,90,110])) & (df['registrationyear'].isin([1993,2006,1995,2008])) & (df['model'].isna())
df.loc[coro,['model']] = 'corolla'
auris = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([177,124,126])) & (df['registrationyear'].isin([2007,2010])) & (df['model'].isna())
df.loc[auris,['model']] = 'auris'
llo = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([72,97,105])) & (df['registrationyear'].isin([1992,2003])) & (df['model'].isna())
df.loc[llo,['model']] = 'corolla'
avensis = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([177])) & (df['model'].isna())
df.loc[avensis,['model']] = 'avensis'
sedoy = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([63,91,180])) & (df['registrationyear'].isin([1993,1998,2009])) & (df['model'].isna())
df.loc[sedoy,['model']] = 'other'
yar = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([1999])) & (df['model'].isna())
df.loc[yar,['model']] = 'yaris'
micra = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([54,65,50,55,40])) & (df['registrationyear'].isin([1994,2009,1998,1995,1999,2000,2004,1991,1996,1997,2008,2013])) & (df['model'].isna())
df.loc[micra,['model']] = 'micra'
micraa = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65,80])) & (df['registrationyear'].isin([2003,2014])) & (df['model'].isna())
df.loc[micraa,['model']] = 'micra'
micraaa = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([2013])) & (df['model'].isna())
df.loc[micraaa,['model']] = 'micra'
qashqai = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2011])) & (df['model'].isna())
df.loc[qashqai,['model']] = 'qashqai'
ibiza = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,64,86,69,70,85])) & (df['registrationyear'].isin([2002,2001,2003,2011,2007])) & (df['model'].isna())
df.loc[ibiza,['model']] = 'ibiza'
arosa = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([50])) & (df['registrationyear'].isin([1999,2002,1998,2000,2001,1997])) & (df['model'].isna())
df.loc[arosa,['model']] = 'arosa'
ibizaa = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([86,101,69])) & (df['registrationyear'].isin([2006,2012,2013])) & (df['model'].isna())
df.loc[ibizaa,['model']] = 'ibiza'
ibiza1 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([200,51])) & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ibiza1,['model']] = 'ibiza'
other1 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([25])) & (df['model'].isna())
df.loc[other1,['model']] = 'other'
cordoba75 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1996,1998])) & (df['model'].isna())
df.loc[cordoba75,['model']] = 'cordoba'
leon07 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105])) & (df['registrationyear'].isin([2007])) & (df['model'].isna())
df.loc[leon07,['model']] = 'leon'
leon160 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([102,160,265])) & (df['registrationyear'].isin([2007,2008,2009,2012])) & (df['model'].isna())
df.loc[leon160,['model']] = 'leon'
toledo = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([101,150])) & (df['registrationyear'].isin([1998,1999])) & (df['model'].isna())
df.loc[toledo,['model']] = 'toledo'
leon140 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([140])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[leon140,['model']] = 'leon'
toledo150 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2000])) & (df['model'].isna())
df.loc[toledo150,['model']] = 'toledo'
ibiza09 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([86])) & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ibiza09,['model']] = 'ibiza'
ibiza07 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([64])) & (df['model'].isna())
df.loc[ibiza07,['model']] = 'ibiza'
getz = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,82,88,97])) & (df['registrationyear'].isin([2003,2007,2002])) & (df['model'].isna())
df.loc[getz,['model']] = 'getz'
i_reihe = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68,77,78])) & (df['registrationyear'].isin([2010,2009,2011,2007])) & (df['model'].isna())
df.loc[i_reihe,['model']] = 'i_reihe'
getz03 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,63,67,65,90])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[getz03,['model']] = 'getz'
yother = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([58,54,55,60,75,40])) & (df['registrationyear'].isin([1998,1999,1996,2000,2001,2002])) & (df['model'].isna())
df.loc[yother,['model']] = 'other'
yot = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[yot,['model']] = 'other'
ir = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([67])) & (df['registrationyear'].isin([2010])) & (df['model'].isna())
df.loc[ir,['model']] = 'i_reihe'
other58 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([58])) & (df['model'].isna())
df.loc[other58,['model']] = 'other'
i = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([63,65,79,90])) & (df['registrationyear'].isin([2011])) & (df['model'].isna())
df.loc[i,['model']] = 'i_reihe'
rei = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([109,90,78])) & (df['registrationyear'].isin([2009,2010,2011])) & (df['model'].isna())
df.loc[rei,['model']] = 'i_reihe'
other99 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([82,140,160,235])) & (df['registrationyear'].isin([1999,2003,2005,2006])) & (df['model'].isna())
df.loc[other99,['model']] = 'other'
other94 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,85,86,131,136,54])) & (df['registrationyear'].isin([1994,2000,2001,2002,2005])) & (df['model'].isna())
df.loc[other94,['model']] = 'other'
santa = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([145,155,170])) & (df['registrationyear'].isin([2003,2002,2004,2006,2008])) & (df['model'].isna())
df.loc[santa,['model']] = 'santa'
he = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([163,140,184])) & (df['registrationyear'].isin([2010,2013])) & (df['model'].isna())
df.loc[he,['model']] = 'i_reihe'
shother = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([163,99])) & (df['registrationyear'].isin([2006,1998,2000,2005])) & (df['model'].isna())
df.loc[shother,['model']] = 'other'
santa140 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([140])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[santa140,['model']] = 'santa'
other150 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2002,2003])) & (df['model'].isna())
df.loc[other150,['model']] = 'other'
santa06 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[santa06,['model']] = 'santa'
c1 = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2008,2011])) & (df['model'].isna())
df.loc[c1,['model']] = 'c1'
c3 = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[c3,['model']] = 'c3'
othercit = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60,75])) & (df['registrationyear'].isin([2001,1999,2000,1998])) & (df['model'].isna())
df.loc[othercit,['model']] = 'other'
fortwo = (df['brand'] == 'smart') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([61,45,54,41,55,71,40,50,72])) & (df['registrationyear'].isin([2005,2002,1999,2001,2000,2012,2004,2003,2008,1998,2007,2011,2009,2014])) & (df['model'].isna())
df.loc[fortwo,['model']] = 'fortwo'
forfour = (df['brand'] == 'smart') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([109])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[forfour,['model']] = 'forfour'
ftvert = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([54,41,45])) & (df['registrationyear'].isin([2000,2001,2005,2006,2008])) & (df['model'].isna())
df.loc[ftvert,['model']] = 'fortwo'
vertft = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([55,61])) & (df['registrationyear'].isin([2000,2001,2002])) & (df['model'].isna())
df.loc[vertft,['model']] = 'fortwo'
ft = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([84])) & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ft,['model']] = 'fortwo'
sixre = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([115,116,166,141,120])) & (df['registrationyear'].isin([1999,2003])) & (df['model'].isna())
df.loc[sixre,['model']] = '6_reihe'
sre = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([1997,1998,1996,2000,1990])) & (df['model'].isna())
df.loc[sre,['model']] = '6_reihe'
mazother = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([144,163,109])) & (df['registrationyear'].isin([1997,2001,1993])) & (df['model'].isna())
df.loc[mazother,['model']] = 'other'
three = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[three,['model']] = '3_reihe'
three88 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([88])) & (df['registrationyear'].isin([1997,1995,1998,1996])) & (df['model'].isna())
df.loc[three88,['model']] = '3_reihe'
rh6 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([166,141,163])) & (df['registrationyear'].isin([2002,2010])) & (df['model'].isna())
df.loc[rh6,['model']] = '6_reihe'
thei = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105,73])) & (df['registrationyear'].isin([1996,2006,1997,2005,2008])) & (df['model'].isna())
df.loc[thei,['model']] = '3_reihe'
ihth = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([144,98,114,150,109,86,75])) & (df['registrationyear'].isin([1999,1995,2003,2006,2000,2010])) & (df['model'].isna())
df.loc[ihth,['model']] = '3_reihe'
eeh = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([1997,1996,2000])) & (df['model'].isna())
df.loc[eeh,['model']] = '3_reihe'
hee = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1997,1999])) & (df['model'].isna())
df.loc[hee,['model']] = '3_reihe'
ri3 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([88,98,65,109])) & (df['registrationyear'].isin([1998,1999,1996,2002,2003,2006,2008])) & (df['model'].isna())
df.loc[ri3,['model']] = '3_reihe'
reihe373 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[reihe373,['model']] = '3_reihe'
other7509 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([2009,2013])) & (df['model'].isna())
df.loc[other7509,['model']] = 'other'
reihe1 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1995])) & (df['model'].isna())
df.loc[reihe1,['model']] = '1_reihe'
punto60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2000.0, 2001.0, 2002.0, 2003.0, 1999.0, 1998.0,
1997.0, 1996.0, 1993.0, 1994.0])) & (df['model'].isna())
df.loc[punto60,['model']] = 'punto'
panda60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2010.0, 2008.0, 2011.0, 1991.0])) & (df['model'].isna())
df.loc[panda60,['model']] = 'panda'
seicento60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55])) & (df['registrationyear'].isin([2000,2001])) & (df['model'].isna())
df.loc[seicento60,['model']] = 'seicento'
punto65 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65])) & (df['registrationyear'].isin([2010.0, 2000.0, 1999.0,
1996.0, 1998.0, 2001.0, 2003.0, 2004.0])) & (df['model'].isna())
df.loc[punto65,['model']] = 'punto'
punto01 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60, 80, 44, 75, 90, 65, 85, 64, 68, 86])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[punto01,['model']] = 'punto'
seicento01 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,50])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[seicento01,['model']] = 'seicento'
stilo170 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([170])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[stilo170,['model']] = 'stilo'
other101 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([101])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[other101,['model']] = 'other'
punto98 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60, 86, 65, 75, 44])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[punto98,['model']] = 'punto'
five69 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([69])) & (df['registrationyear'].isin([2008.0, 2009.0, 2010.0, 2013.0])) & (df['model'].isna())
df.loc[five69,['model']] = '500'
puntorand = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([80, 86, 85, 69, 64])) & (df['registrationyear'].isin([1999,2003,2000])) & (df['model'].isna())
df.loc[puntorand,['model']] = 'punto'
stilo103 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([103, 80, 170, 115, 102])) & (df['registrationyear'].isin([2002.0, 2003.0, 2004.0, 2005.0])) & (df['model'].isna())
df.loc[stilo103,['model']] = 'stilo'
bravo150 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2007.0, 2008.0])) & (df['model'].isna())
df.loc[bravo150,['model']] = 'bravo'
bravo08 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2008.0])) & (df['model'].isna())
df.loc[bravo08,['model']] = 'bravo'
punto60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2003,2000])) & (df['model'].isna())
df.loc[punto60,['model']] = 'punto'
re2 = (df['brand'] == 'peugeot') & (df['vehicletype'].isin(['small']))& (df['power'].isin([60])) & (df['registrationyear'].isin([2004.0, 2005.0,
2011.0, 2010.0, 1990.0])) & (df['model'].isna())
df.loc[re2,['model']] = '2_reihe'
twore = (df['brand'] == 'peugeot') & (df['vehicletype'].isin(['convertible']))& (df['power'].isin([120,109])) & (df['registrationyear'].isin([2003.0, 2002.0, 2004.0, 2005.0, 2011.0, 2012.0])) & (df['model'].isna())
df.loc[twore,['model']] = '2_reihe'
fiestarand = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([82, 150,182, 61,81])) & (df['registrationyear'].isin([2006.0, 2009.0, 2014.0, 2000.0, 2005.0])) & (df['model'].isna())
df.loc[fiestarand,['model']] = 'fiesta'
fiestaa = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([60])) & (df['registrationyear'].isin([1992.0, 2010.0])) & (df['model'].isna())
df.loc[fiestaa,['model']] = 'fiesta'
fiestaaa = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([50, 75, 54, 66, 103])) & (df['registrationyear'].isin([2000])) & (df['model'].isna())
df.loc[fiestaaa,['model']] = 'fiesta'
del audi75,opelastra,mini75,nissan60,ibiza03,skoda60,lancia60,smart60,bmw101,mit99,mitbus00,honda101
del golf,golf02,golf98,golf09,golf99,golf04,golf91
del polo,polo98,passat97,beetle,jetta,phaeton,phaeton05
del brand_power,brand_power1,brand_power2,brand_power3,brand_power4,brand_power5,brand_power6
del top5_brand_power,over1000_brand_power,under1000_brand_power,top5_brands ,top5_brands2,top5_brands3,top5_brands4,top5_brands5,top5_brands6
del middle_brands,middle_brands2,middle_brands3,middle_brands4,middle_brands5,middle_brands6
del lower_brands,lower_brands2,lower_brands3,lower_brands4,lower_brands5,lower_brands6
del top5,top52,top53,top54,top55,top56,middle,middle2,middle3,middle4,middle5,middle6,lower,lower2,lower3,lower4,lower5,lower6
del topbrand_vt, vt_power, vwvt
del trabant,bmw,vw80,opel82,other82,volvo89,audi100,freelander,ypsilon,logan,porscheother,justy,otherrover,chryslerother,voyager
del t601,six,otherchevy,volvoother,kother,rio,sorento,civic,jazz,hother,civcou,honother,cother,jbus,octavia,swift,suzother
del ukiother,jimny,zother,carisma,colt,coltt, lancer, galant, wother, yaris, aygo, yar, cor, corolla, sixty, tother, coro, auris
del llo,avensis,sedoy,micra,micraa,micraaa,ibiza,arosa,ibizaa,ibiza1,other1,cordoba75,leon07,leon160,toledo,leon140,toledo150
del ibiza09,ibiza07,getz,i_reihe,getz03,yother,yot,ir,other58,i,rei,other99,other94,santa,he,shother,santa140,other150,santa06
del c1,c3,othercit,fortwo,forfour,ftvert,vertft,sixre,sre,mazother,three,three88,rh6,thei,ihth,eeh,hee,ri3,reihe373,other7509
del reihe1,punto60,panda60,seicento60,punto65,punto01,seicento01,stilo170,other101,punto98,five69,puntorand,stilo103,bravo150
del bravo08,re2,twore,fiestarand,fiestaa,fiestaaa
gc.collect()
899
def analyze_missing_models(df, brand):
# Focus on the brand
brand_df = df[df['brand'] == brand]
# Step 1: Check which vehicle types are most common for missing models
vt_counts = brand_df[brand_df['model'].isna()]['vehicletype'].value_counts()
print(f"\n--- {brand.upper()} ---")
print("Vehicle types with missing models:")
print(vt_counts)
# Step 2: For each vehicle type, show power distribution
for vt in vt_counts.index:
subset = brand_df[(brand_df['model'].isna()) & (brand_df['vehicletype'] == vt)]
pw_counts = subset['power'].value_counts()
print(f"\n{vt}: Power distribution for missing models")
print(pw_counts)
print(pw_counts.index)
# Step 3: Show registration year distribution
reg_counts = subset['registrationyear'].value_counts()
print(f"\n{vt}: Registration year distribution for missing models")
print(reg_counts)
print(reg_counts.index)
analyze_missing_models(df, 'ford')
--- FORD ---
Vehicle types with missing models:
small 220
wagon 136
sedan 94
bus 57
coupe 51
suv 33
other 21
convertible 20
Name: vehicletype, dtype: int64
small: Power distribution for missing models
0 70
60 60
50 20
75 18
45 5
90 5
80 5
70 5
44 4
55 4
65 3
68 3
100 2
95 2
101 2
116 2
96 1
118 1
110 1
71 1
74 1
69 1
67 1
63 1
59 1
173 1
Name: power, dtype: int64
Int64Index([ 0, 60, 50, 75, 45, 90, 80, 70, 44, 55, 65, 68, 100,
95, 101, 116, 96, 118, 110, 71, 74, 69, 67, 63, 59, 173],
dtype='int64')
small: Registration year distribution for missing models
1999.0 30
1998.0 26
2000.0 23
1997.0 23
2002.0 22
2001.0 20
1996.0 12
2004.0 12
2003.0 11
2005.0 10
1990.0 6
2006.0 5
1995.0 3
2007.0 3
1992.0 3
1989.0 2
2014.0 1
2008.0 1
2012.0 1
1994.0 1
1978.0 1
2009.0 1
2010.0 1
2011.0 1
1993.0 1
Name: registrationyear, dtype: int64
Float64Index([1999.0, 1998.0, 2000.0, 1997.0, 2002.0, 2001.0, 1996.0, 2004.0,
2003.0, 2005.0, 1990.0, 2006.0, 1995.0, 2007.0, 1992.0, 1989.0,
2014.0, 2008.0, 2012.0, 1994.0, 1978.0, 2009.0, 2010.0, 2011.0,
1993.0],
dtype='float64')
wagon: Power distribution for missing models
0 36
115 15
116 11
131 9
90 9
101 9
100 6
125 6
109 5
75 4
120 4
170 3
110 3
105 3
136 2
150 2
117 2
60 1
128 1
130 1
89 1
140 1
145 1
155 1
Name: power, dtype: int64
Int64Index([ 0, 115, 116, 131, 90, 101, 100, 125, 109, 75, 120, 170, 110,
105, 136, 150, 117, 60, 128, 130, 89, 140, 145, 155],
dtype='int64')
wagon: Registration year distribution for missing models
2002.0 23
2001.0 19
2000.0 14
1999.0 13
1998.0 13
2005.0 12
2003.0 10
2004.0 9
2006.0 5
1997.0 4
1996.0 4
2007.0 3
1995.0 2
2008.0 2
1990.0 1
2009.0 1
2011.0 1
Name: registrationyear, dtype: int64
Float64Index([2002.0, 2001.0, 2000.0, 1999.0, 1998.0, 2005.0, 2003.0, 2004.0,
2006.0, 1997.0, 1996.0, 2007.0, 1995.0, 2008.0, 1990.0, 2009.0,
2011.0],
dtype='float64')
sedan: Power distribution for missing models
0 19
90 7
75 7
100 7
116 5
60 5
136 4
115 4
101 3
110 3
50 2
226 2
77 2
95 2
230 1
105 1
1002 1
109 1
1120 1
94 1
80 1
89 1
85 1
130 1
66 1
55 1
205 1
170 1
38 1
29 1
148 1
147 1
146 1
145 1
131 1
120 1
Name: power, dtype: int64
Int64Index([ 0, 90, 75, 100, 116, 60, 136, 115, 101, 110, 50,
226, 77, 95, 230, 105, 1002, 109, 1120, 94, 80, 89,
85, 130, 66, 55, 205, 170, 38, 29, 148, 147, 146,
145, 131, 120],
dtype='int64')
sedan: Registration year distribution for missing models
1999.0 11
1998.0 11
2000.0 9
1995.0 7
1997.0 7
2005.0 7
2001.0 6
1996.0 5
2002.0 5
2006.0 4
2009.0 3
1993.0 2
1990.0 2
1989.0 2
1976.0 1
2013.0 1
1960.0 1
1940.0 1
1970.0 1
2007.0 1
1994.0 1
1977.0 1
2004.0 1
1988.0 1
1978.0 1
1979.0 1
1967.0 1
Name: registrationyear, dtype: int64
Float64Index([1999.0, 1998.0, 2000.0, 1995.0, 1997.0, 2005.0, 2001.0, 1996.0,
2002.0, 2006.0, 2009.0, 1993.0, 1990.0, 1989.0, 1976.0, 2013.0,
1960.0, 1940.0, 1970.0, 2007.0, 1994.0, 1977.0, 2004.0, 1988.0,
1978.0, 1979.0, 1967.0],
dtype='float64')
bus: Power distribution for missing models
0 18
116 9
125 4
90 3
140 2
131 2
80 2
145 2
115 2
75 2
135 1
101 1
175 1
110 1
128 1
100 1
98 1
130 1
146 1
147 1
211 1
Name: power, dtype: int64
Int64Index([ 0, 116, 125, 90, 140, 131, 80, 145, 115, 75, 135, 101, 175,
110, 128, 100, 98, 130, 146, 147, 211],
dtype='int64')
bus: Registration year distribution for missing models
2005.0 9
2001.0 7
1998.0 6
1999.0 5
2009.0 4
2000.0 4
2003.0 4
2006.0 3
1996.0 3
1995.0 3
2008.0 2
2007.0 2
1997.0 2
1993.0 1
1992.0 1
2004.0 1
Name: registrationyear, dtype: int64
Float64Index([2005.0, 2001.0, 1998.0, 1999.0, 2009.0, 2000.0, 2003.0, 2006.0,
1996.0, 1995.0, 2008.0, 2007.0, 1997.0, 1993.0, 1992.0, 2004.0],
dtype='float64')
coupe: Power distribution for missing models
0 13
170 6
130 5
131 4
125 3
90 2
147 2
163 2
132 1
122 1
120 1
116 1
115 1
109 1
69 1
101 1
136 1
179 1
145 1
140 1
138 1
100 1
Name: power, dtype: int64
Int64Index([ 0, 170, 130, 131, 125, 90, 147, 163, 132, 122, 120, 116, 115,
109, 69, 101, 136, 179, 145, 140, 138, 100],
dtype='int64')
coupe: Registration year distribution for missing models
2000.0 12
2002.0 7
1999.0 7
1998.0 5
2001.0 4
1994.0 2
1992.0 2
1995.0 2
2006.0 2
2007.0 1
2009.0 1
1996.0 1
1980.0 1
1978.0 1
1997.0 1
1979.0 1
1975.0 1
Name: registrationyear, dtype: int64
Float64Index([2000.0, 2002.0, 1999.0, 1998.0, 2001.0, 1994.0, 1992.0, 1995.0,
2006.0, 2007.0, 2009.0, 1996.0, 1980.0, 1978.0, 1997.0, 1979.0,
1975.0],
dtype='float64')
suv: Power distribution for missing models
124 8
0 7
136 3
150 2
165 2
125 2
196 1
203 1
140 1
207 1
340 1
156 1
163 1
109 1
305 1
Name: power, dtype: int64
Int64Index([124, 0, 136, 150, 165, 125, 196, 203, 140, 207, 340, 156, 163, 109,
305],
dtype='int64')
suv: Registration year distribution for missing models
1994.0 9
2008.0 2
2003.0 2
1996.0 2
1998.0 2
2004.0 2
2002.0 1
1977.0 1
2009.0 1
2012.0 1
2013.0 1
2000.0 1
2001.0 1
1987.0 1
2010.0 1
2006.0 1
1989.0 1
2011.0 1
1997.0 1
1995.0 1
Name: registrationyear, dtype: int64
Float64Index([1994.0, 2008.0, 2003.0, 1996.0, 1998.0, 2004.0, 2002.0, 1977.0,
2009.0, 2012.0, 2013.0, 2000.0, 2001.0, 1987.0, 2010.0, 2006.0,
1989.0, 2011.0, 1997.0, 1995.0],
dtype='float64')
other: Power distribution for missing models
0 6
101 2
115 2
60 2
157 2
226 1
70 1
80 1
109 1
175 1
240 1
205 1
Name: power, dtype: int64
Int64Index([0, 101, 115, 60, 157, 226, 70, 80, 109, 175, 240, 205], dtype='int64')
other: Registration year distribution for missing models
2008.0 3
1993.0 2
1984.0 2
1998.0 2
2000.0 2
2005.0 2
1964.0 1
1959.0 1
2001.0 1
1953.0 1
2006.0 1
1997.0 1
1999.0 1
1996.0 1
Name: registrationyear, dtype: int64
Float64Index([2008.0, 1993.0, 1984.0, 1998.0, 2000.0, 2005.0, 1964.0, 1959.0,
2001.0, 1953.0, 2006.0, 1997.0, 1999.0, 1996.0],
dtype='float64')
convertible: Power distribution for missing models
105 5
90 4
0 3
95 3
70 1
145 1
115 1
116 1
190 1
Name: power, dtype: int64
Int64Index([105, 90, 0, 95, 70, 145, 115, 116, 190], dtype='int64')
convertible: Registration year distribution for missing models
2004.0 4
1992.0 3
1995.0 2
1996.0 2
1994.0 1
2008.0 1
2003.0 1
1988.0 1
1997.0 1
1989.0 1
1999.0 1
1993.0 1
1991.0 1
Name: registrationyear, dtype: int64
Float64Index([2004.0, 1992.0, 1995.0, 1996.0, 1994.0, 2008.0, 2003.0, 1988.0,
1997.0, 1989.0, 1999.0, 1993.0, 1991.0],
dtype='float64')
analyze_missing_models(df, 'mercedes_benz')
--- MERCEDES_BENZ ---
Vehicle types with missing models:
sedan 388
wagon 157
coupe 84
small 45
bus 45
convertible 29
suv 22
other 21
Name: vehicletype, dtype: int64
sedan: Power distribution for missing models
0 100
122 31
136 23
170 19
143 16
...
166 1
54 1
172 1
174 1
10912 1
Name: power, Length: 74, dtype: int64
Int64Index([ 0, 122, 136, 170, 143, 150, 224, 109, 204,
163, 160, 75, 118, 177, 82, 125, 306, 193,
116, 184, 95, 132, 197, 218, 90, 102, 108,
87, 156, 220, 225, 65, 231, 129, 190, 72,
272, 234, 278, 320, 260, 341, 88, 300, 387,
292, 388, 192, 265, 161, 186, 16051, 86, 600,
94, 96, 103, 105, 106, 107, 110, 121, 123,
126, 130, 140, 142, 60, 52, 166, 54, 172,
174, 10912],
dtype='int64')
sedan: Registration year distribution for missing models
1999.0 27
2002.0 26
1996.0 25
2001.0 22
2000.0 22
1998.0 22
2003.0 20
1992.0 18
2006.0 15
1997.0 15
2005.0 14
1990.0 14
1991.0 13
2007.0 13
1989.0 13
1995.0 11
2008.0 11
1987.0 10
1993.0 9
1986.0 8
1994.0 7
2004.0 7
1982.0 7
1983.0 5
1981.0 5
1988.0 4
2010.0 4
1966.0 3
1974.0 2
1984.0 2
1985.0 2
2012.0 2
2009.0 2
2011.0 1
1968.0 1
1967.0 1
1971.0 1
1976.0 1
1980.0 1
1969.0 1
1956.0 1
Name: registrationyear, dtype: int64
Float64Index([1999.0, 2002.0, 1996.0, 2001.0, 2000.0, 1998.0, 2003.0, 1992.0,
2006.0, 1997.0, 2005.0, 1990.0, 1991.0, 2007.0, 1989.0, 1995.0,
2008.0, 1987.0, 1993.0, 1986.0, 1994.0, 2004.0, 1982.0, 1983.0,
1981.0, 1988.0, 2010.0, 1966.0, 1974.0, 1984.0, 1985.0, 2012.0,
2009.0, 2011.0, 1968.0, 1967.0, 1971.0, 1976.0, 1980.0, 1969.0,
1956.0],
dtype='float64')
wagon: Power distribution for missing models
0 34
150 16
122 15
170 11
136 10
163 8
143 8
116 7
90 6
125 6
204 5
193 4
224 4
132 2
130 2
120 2
177 2
164 2
156 1
102 1
272 1
165 1
280 1
110 1
184 1
115 1
196 1
197 1
121 1
205 1
218 1
Name: power, dtype: int64
Int64Index([ 0, 150, 122, 170, 136, 163, 143, 116, 90, 125, 204, 193, 224,
132, 130, 120, 177, 164, 156, 102, 272, 165, 280, 110, 184, 115,
196, 197, 121, 205, 218],
dtype='int64')
wagon: Registration year distribution for missing models
1997.0 21
1998.0 16
2003.0 14
2002.0 13
1999.0 10
2008.0 10
2004.0 9
2001.0 8
2000.0 8
1996.0 7
2006.0 6
2005.0 5
1989.0 5
2010.0 4
1995.0 4
1993.0 4
1992.0 3
1994.0 3
1991.0 2
2007.0 2
2014.0 1
2012.0 1
2009.0 1
Name: registrationyear, dtype: int64
Float64Index([1997.0, 1998.0, 2003.0, 2002.0, 1999.0, 2008.0, 2004.0, 2001.0,
2000.0, 1996.0, 2006.0, 2005.0, 1989.0, 2010.0, 1995.0, 1993.0,
1992.0, 1994.0, 1991.0, 2007.0, 2014.0, 2012.0, 2009.0],
dtype='float64')
coupe: Power distribution for missing models
0 10
163 8
218 6
136 6
306 6
170 5
197 4
231 4
200 3
169 3
272 3
109 2
305 2
150 2
130 2
192 2
224 2
143 2
132 2
220 2
500 1
179 1
208 1
193 1
186 1
292 1
279 1
122 1
Name: power, dtype: int64
Int64Index([ 0, 163, 218, 136, 306, 170, 197, 231, 200, 169, 272, 109, 305,
150, 130, 192, 224, 143, 132, 220, 500, 179, 208, 193, 186, 292,
279, 122],
dtype='int64')
coupe: Registration year distribution for missing models
2002.0 18
2001.0 10
2000.0 7
2004.0 5
2006.0 4
1998.0 4
2005.0 4
1999.0 3
2003.0 3
2007.0 3
1982.0 3
1988.0 3
1997.0 3
2010.0 2
1991.0 2
1978.0 2
1992.0 1
1990.0 1
1984.0 1
1985.0 1
2009.0 1
1972.0 1
1995.0 1
2008.0 1
Name: registrationyear, dtype: int64
Float64Index([2002.0, 2001.0, 2000.0, 2004.0, 2006.0, 1998.0, 2005.0, 1999.0,
2003.0, 2007.0, 1982.0, 1988.0, 1997.0, 2010.0, 1991.0, 1978.0,
1992.0, 1990.0, 1984.0, 1985.0, 2009.0, 1972.0, 1995.0, 2008.0],
dtype='float64')
small: Power distribution for missing models
0 13
82 8
90 4
95 4
60 4
75 3
102 3
108 2
74 1
80 1
125 1
62 1
Name: power, dtype: int64
Int64Index([0, 82, 90, 95, 60, 75, 102, 108, 74, 80, 125, 62], dtype='int64')
small: Registration year distribution for missing models
2000.0 13
2001.0 7
1998.0 6
2003.0 5
2004.0 3
1999.0 3
2006.0 2
1997.0 2
2002.0 2
2008.0 1
1990.0 1
Name: registrationyear, dtype: int64
Float64Index([2000.0, 2001.0, 1998.0, 2003.0, 2004.0, 1999.0, 2006.0, 1997.0,
2002.0, 2008.0, 1990.0],
dtype='float64')
bus: Power distribution for missing models
0 15
122 5
150 4
129 4
70 2
130 1
100 1
55 1
116 1
110 1
109 1
103 1
95 1
224 1
90 1
85 1
82 1
140 1
200 1
156 1
Name: power, dtype: int64
Int64Index([ 0, 122, 150, 129, 70, 130, 100, 55, 116, 110, 109, 103, 95,
224, 90, 85, 82, 140, 200, 156],
dtype='int64')
bus: Registration year distribution for missing models
2001.0 7
2006.0 5
2002.0 5
2008.0 4
2007.0 3
2000.0 3
2005.0 3
2003.0 3
1994.0 2
1999.0 2
2004.0 2
2009.0 2
1984.0 1
2011.0 1
1996.0 1
1998.0 1
Name: registrationyear, dtype: int64
Float64Index([2001.0, 2006.0, 2002.0, 2008.0, 2007.0, 2000.0, 2005.0, 2003.0,
1994.0, 1999.0, 2004.0, 2009.0, 1984.0, 2011.0, 1996.0, 1998.0],
dtype='float64')
convertible: Power distribution for missing models
0 8
163 4
326 4
136 3
170 3
193 1
231 1
198 1
240 1
218 1
220 1
168 1
Name: power, dtype: int64
Int64Index([0, 163, 326, 136, 170, 193, 231, 198, 240, 218, 220, 168], dtype='int64')
convertible: Registration year distribution for missing models
2004.0 5
2001.0 3
1992.0 3
2000.0 3
1995.0 2
2002.0 2
2007.0 2
1984.0 1
1998.0 1
1993.0 1
1983.0 1
1968.0 1
1989.0 1
1960.0 1
2003.0 1
2005.0 1
Name: registrationyear, dtype: int64
Float64Index([2004.0, 2001.0, 1992.0, 2000.0, 1995.0, 2002.0, 2007.0, 1984.0,
1998.0, 1993.0, 1983.0, 1968.0, 1989.0, 1960.0, 2003.0, 2005.0],
dtype='float64')
suv: Power distribution for missing models
0 4
163 3
218 3
190 3
165 2
224 2
167 1
200 1
250 1
272 1
150 1
Name: power, dtype: int64
Int64Index([0, 163, 218, 190, 165, 224, 167, 200, 250, 272, 150], dtype='int64')
suv: Registration year distribution for missing models
2007.0 5
2000.0 5
2001.0 3
1998.0 3
2008.0 1
2003.0 1
2006.0 1
1989.0 1
2002.0 1
2005.0 1
Name: registrationyear, dtype: int64
Float64Index([2007.0, 2000.0, 2001.0, 1998.0, 2008.0, 2003.0, 2006.0, 1989.0,
2002.0, 2005.0],
dtype='float64')
other: Power distribution for missing models
0 11
75 2
129 1
99 1
72 1
79 1
80 1
116 1
90 1
95 1
Name: power, dtype: int64
Int64Index([0, 75, 129, 99, 72, 79, 80, 116, 90, 95], dtype='int64')
other: Registration year distribution for missing models
2001.0 3
1999.0 2
1991.0 2
2008.0 1
1997.0 1
1993.0 1
2000.0 1
1983.0 1
1988.0 1
2007.0 1
2006.0 1
1981.0 1
2013.0 1
1992.0 1
1980.0 1
2016.0 1
1971.0 1
Name: registrationyear, dtype: int64
Float64Index([2001.0, 1999.0, 1991.0, 2008.0, 1997.0, 1993.0, 2000.0, 1983.0,
1988.0, 2007.0, 2006.0, 1981.0, 2013.0, 1992.0, 1980.0, 2016.0,
1971.0],
dtype='float64')
# When the year is 1996 there are 149 c_klasse - no other model listed
# When the year is 1994 there are 140 c_klasse and 1 b_klasse (b_klasse was not made in 1994)
# When year is 1995 there are 124 c_klasse and 1 e_klasse - most likely replacement is c_klasse
# When year is 1997 there are 109 c_klasse and 1 b_klasse - again b_klasse not made until 2005
# When year 1998 - 101 c_klasse - no other models listed
# Year 1999 shown below - g_klasse did not have a sedan
display(df[(df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([143,122,125])) & \
(df['model'].isna())].value_counts(subset = 'registrationyear').index)
display(df[(df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['registrationyear'].isin([1999])) & (df['model'].isna())].value_counts(subset = 'power').index)
display(df[(df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([143])) & \
(df['registrationyear'].isin([1999])) & (df['model'].notna())].value_counts(subset = 'model'))
display(df[(df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([122])) & \
(df['registrationyear'].isin([1999])) & (df['model'].notna())].value_counts(subset = 'model'))
# When HP 150: e_klasse - 111, c_klasse - 1, other - 1 (majority is e_klasse)
# When HP 177: e_klasse - 65, other - 4, c_klasse - 2 (majority is e_klasse)
# When HP 306: shown below (majority is e_klasse - only 2 missing models from this category (can justify filling in as e_klasse))
display(df[(df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([306])) & (df['registrationyear'].isin([2002])) & (df['model'].notna())].value_counts(subset = 'model'))
# When HP 102: a_klasse - 70, c_klasse - 6, e_klasse - 1 (majority is a_klasse)
# When HP 82: shown below
display(df[(df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([82])) & (df['registrationyear'].isin([2000])) & (df['model'].notna())].value_counts(subset = 'model'))
Float64Index([1999.0, 2002.0, 1998.0, 1996.0, 1997.0, 2001.0, 1989.0, 1990.0,
1991.0, 1992.0, 1994.0, 1995.0, 2003.0, 2000.0, 2004.0, 1985.0,
1987.0, 1993.0, 2012.0],
dtype='float64', name='registrationyear')
Int64Index([0, 143, 122, 125, 204, 224, 86, 102, 109, 110, 170, 174, 177, 225], dtype='int64', name='power')
model e_klasse 24 dtype: int64
model c_klasse 72 g_klasse 1 dtype: int64
model e_klasse 41 s_klasse 2 dtype: int64
model a_klasse 67 dtype: int64
ek = (df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([143])) & \
(df['registrationyear'].isin([1999])) & (df['model'].isna())
df.loc[ek,['model']] = 'e_klasse'
ck = (df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([122])) & \
(df['registrationyear'].isin([1996,1994,1995, 1997, 1998, 1999])) & (df['model'].isna())
df.loc[ck,['model']] = 'c_klasse'
e = (df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([150, 177,306])) & (df['registrationyear'].isin([2002])) & (df['model'].isna())
df.loc[e,['model']] = 'e_klasse'
a = (df['brand'] == 'mercedes_benz') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([102,82])) & (df['registrationyear'].isin([2000])) & (df['model'].isna())
df.loc[a,['model']] = 'a_klasse'
df.isna().sum()
datecrawled 0 price 0 vehicletype 37471 registrationyear 32 gearbox 19830 power 0 model 18340 mileage 0 registrationmonth 0 fueltype 32889 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 19701 dtype: int64
del ek, ck, e, a
gc.collect()
0
def fill_missing_models_majority(df, threshold=0.9):
"""
Unified model-filling strategy that combines:
1. Unique-combination inference
2. Majority vote with power+registrationyear
3. Majority vote with year_bin
4. Tiered fallback grouping strategy
Only fills when confidence is above the given threshold.
"""
import numpy as np
import pandas as pd
df = df.copy()
# -----------------------------
# STEP 0: Create year_bin
# -----------------------------
def categorize_year(year):
if pd.isna(year):
return np.nan
elif year < 1990:
return 'before_1990'
elif year < 2000:
return '1990s'
elif year < 2010:
return '2000s'
else:
return '2010_plus'
df['year_bin'] = df['registrationyear'].apply(categorize_year)
# -----------------------------
# STEP 1: Unique combination rule
# -----------------------------
known = df[df['model'].notna()]
missing = df[df['model'].isna()]
unique_models = (
known.groupby(['brand', 'vehicletype', 'power', 'registrationyear'])['model']
.nunique()
.reset_index(name='model_count')
)
unique_keys = unique_models[unique_models['model_count'] == 1].drop(columns='model_count')
unique_known = known.merge(
unique_keys,
on=['brand', 'vehicletype', 'power', 'registrationyear'],
how='inner'
)[["brand", "vehicletype", "power", "registrationyear", "model"]].drop_duplicates()
missing = missing.merge(
unique_known,
on=['brand', 'vehicletype', 'power', 'registrationyear'],
how='left',
suffixes=('', '_uniq')
)
missing['model'] = missing['model_uniq'].combine_first(missing['model'])
missing.drop(columns=['model_uniq'], inplace=True)
df = pd.concat([known, missing], ignore_index=True)
# -----------------------------
# STEP 2: Majority rule with power + registrationyear
# -----------------------------
known = df[df['model'].notna()]
missing = df[df['model'].isna()]
model_stats = (
known.groupby(['brand', 'vehicletype', 'power', 'registrationyear', 'model'])
.size()
.groupby(level=[0,1,2,3])
.apply(lambda x: x / x.sum())
.reset_index(name='model_share')
)
dominant = (
model_stats[model_stats['model_share'] >= threshold]
.sort_values('model_share', ascending=False)
.drop_duplicates(subset=['brand','vehicletype','power','registrationyear'])
)
missing = missing.merge(
dominant[['brand','vehicletype','power','registrationyear','model']],
on=['brand','vehicletype','power','registrationyear'],
how='left',
suffixes=('', '_maj')
)
missing['model'] = missing['model_maj'].combine_first(missing['model'])
missing.drop(columns=['model_maj'], inplace=True)
df = pd.concat([known, missing], ignore_index=True)
# -----------------------------
# STEP 3: Majority rule using year_bin
# -----------------------------
known = df[df['model'].notna()]
missing = df[df['model'].isna()]
model_counts = (
known.groupby(['brand','vehicletype','year_bin'])['model']
.value_counts(normalize=True)
.rename('freq')
.reset_index()
)
majority_models = model_counts[model_counts['freq'] >= threshold]
missing = missing.merge(
majority_models[['brand','vehicletype','year_bin','model']],
on=['brand','vehicletype','year_bin'],
how='left',
suffixes=('', '_bin')
)
missing['model'] = missing['model_bin'].combine_first(missing['model'])
missing.drop(columns=['model_bin'], inplace=True)
df = pd.concat([known, missing], ignore_index=True)
# -----------------------------
# STEP 4: Tiered fallback strategy
# -----------------------------
missing_before = df['model'].isna().sum()
groupings = [
['brand', 'vehicletype'],
['brand', 'vehicletype', 'year_bin'],
['brand', 'fueltype', 'vehicletype'],
['brand', 'vehicletype', 'fueltype', 'year_bin']
]
for cols in groupings:
def mode_freq_df(x):
m = x.mode()
if m.empty:
return pd.DataFrame({'model_mode':[np.nan],'freq':[0]})
freq = (x == m[0]).sum() / len(x)
return pd.DataFrame({'model_mode':[m[0]], 'freq':[freq]})
majority_df = df.groupby(cols)['model'].apply(mode_freq_df).reset_index()
majority_df = majority_df[majority_df['freq'] >= threshold]
lookup = {
tuple(row[c] for c in cols): row['model_mode']
for _, row in majority_df.iterrows()
}
df['model'] = df.apply(
lambda row: lookup.get(tuple(row[c] for c in cols), row['model'])
if pd.isna(row['model']) else row['model'],
axis=1
)
missing_after = df['model'].isna().sum()
print(f"✅ Filled {missing_before - missing_after} missing models (threshold={threshold:.0%})")
return df
df_model = fill_missing_models_majority(df)
df_model.isna().sum()
✅ Filled 50 missing models (threshold=90%)
datecrawled 0 price 0 vehicletype 37471 registrationyear 32 gearbox 19830 power 0 model 14555 mileage 0 registrationmonth 0 fueltype 32889 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 19701 year_bin 32 dtype: int64
def fill_missing_vehicle_type(df, threshold=0.9):
"""
Fills missing vehicle types based on the most common value within:
1. (brand, model, year_bin, power)
2. (brand, model, year_bin)
3. (brand, model)
Only fills when the confidence (frequency ratio of the mode)
is above the given threshold.
"""
df = df.copy()
def safe_mode(series):
m = series.mode(dropna=True)
return m.iloc[0] if not m.empty else np.nan
# -----------------------------
# Helper: compute majority + confidence
# -----------------------------
def compute_majority(group_cols, name):
grouped = df.groupby(group_cols)['vehicletype']
majority = grouped.apply(safe_mode)
confidence = grouped.apply(
lambda x: x.value_counts(normalize=True).iloc[0]
if not x.dropna().empty else 0
)
majority = majority[confidence >= threshold] # keep only high-confidence groups
return majority.rename(name).reset_index()
# -----------------------------
# STEP 1: (brand, model, year_bin, power)
# ------------------------------
cols_lvl1 = ['brand', 'model', 'year_bin', 'power']
majority_lvl1 = compute_majority(cols_lvl1, 'lvl1_type')
df = df.merge(majority_lvl1, on=cols_lvl1, how='left')
# -----------------------------
# STEP 2: (brand, model, year_bin)
# ------------------------------
cols_lvl2 = ['brand', 'model', 'year_bin']
majority_lvl2 = compute_majority(cols_lvl2, 'lvl2_type')
df = df.merge(majority_lvl2, on=cols_lvl2, how='left')
# -----------------------------
# STEP 3: (brand, model)
# ------------------------------
cols_lvl3 = ['brand', 'model']
majority_lvl3 = compute_majority(cols_lvl3, 'lvl3_type')
df = df.merge(majority_lvl3, on=cols_lvl3, how='left')
# -----------------------------
# STEP 4: Fill progressively
# ------------------------------
df['vehicletype'] = (
df['vehicletype']
.fillna(df['lvl1_type'])
.fillna(df['lvl2_type'])
.fillna(df['lvl3_type'])
)
# -----------------------------
# STEP 5: Cleanup
# ------------------------------
df.drop(columns=['lvl1_type', 'lvl2_type', 'lvl3_type'], inplace=True)
return df
# 14,224 `vehicletype` rows filled
df_vt = fill_missing_vehicle_type(df_model)
df_vt.isna().sum()
datecrawled 0 price 0 vehicletype 23247 registrationyear 32 gearbox 19830 power 0 model 14555 mileage 0 registrationmonth 0 fueltype 32889 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 19701 year_bin 32 dtype: int64
def fill_missing_fueltype(df, group_cols=None, threshold=0.9):
if group_cols is None:
group_cols = ['brand', 'model', 'power', 'vehicletype', 'registrationyear']
df = df.copy()
# Step 1: Compute mode fueltype per group
fuel_mode_stats = (
df[df['fueltype'].notna()]
.groupby(group_cols)['fueltype']
.agg(lambda x: x.mode()[0] if not x.mode().empty else None)
.reset_index(name='mode_fueltype')
)
# Step 2: Compute how dominant (confident) that mode is
fuel_freq_stats = (
df[df['fueltype'].notna()]
.groupby(group_cols)['fueltype']
.value_counts(normalize=True)
.groupby(level=list(range(len(group_cols))))
.max()
.reset_index(name='mode_freq')
)
# Step 3: Keep only groups with strong mode agreement
fuel_stats = pd.merge(fuel_mode_stats, fuel_freq_stats, on=group_cols)
fuel_stats = fuel_stats[fuel_stats['mode_freq'] >= threshold]
# Step 4: Merge back and fill missing
df = df.merge(fuel_stats, on=group_cols, how='left')
df['fueltype'] = df.apply(
lambda row: row['mode_fueltype'] if pd.isna(row['fueltype']) and pd.notna(row['mode_fueltype'])
else row['fueltype'],
axis=1
)
# Step 5: Clean up helper columns
df = df.drop(columns=['mode_fueltype', 'mode_freq'], errors='ignore')
return df
# 13,386 `fueltype` values filled
df_ft = fill_missing_fueltype(df_vt)
df_ft.isna().sum()
datecrawled 0 price 0 vehicletype 23247 registrationyear 32 gearbox 19830 power 0 model 14555 mileage 0 registrationmonth 0 fueltype 19503 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 19701 year_bin 32 dtype: int64
def fill_zero_power(df, group_cols=None, threshold=0.9):
"""
Fill zero horsepower (HP) values using mode-based imputation with a confidence threshold.
group_cols : list of str, optional
Columns to group by when determining mode HP.
Default: ['brand', 'model', 'fueltype', 'registrationyear']
returns df :
DataFrame with zero HP values filled where confident mode exists.
"""
# Default grouping columns
if group_cols is None:
group_cols = ['brand', 'model', 'vehicletype','fueltype', 'registrationyear']
df = df.copy() # Work on a copy to avoid side effects
# Step 1: Compute mode HP for each group
hp_mode_stats = (
df[df['power'] > 0] # Only consider valid HPs
.groupby(group_cols)['power']
.agg(lambda x: x.mode()[0] if not x.mode().empty else None)
.reset_index(name='mode_hp')
)
# Step 2: Compute mode frequency (confidence)
hp_freq_stats = (
df[df['power'] > 0]
.groupby(group_cols)['power']
.value_counts(normalize=True)
.groupby(level=list(range(len(group_cols)))) # Group again by same keys
.max()
.reset_index(name='mode_freq')
)
# Step 3: Keep only groups where mode occurs ≥ threshold fraction of the time
hp_stats = pd.merge(hp_mode_stats, hp_freq_stats, on=group_cols)
hp_stats = hp_stats[hp_stats['mode_freq'] >= threshold]
# Step 4: Merge imputation info back to df
df = df.merge(hp_stats, on=group_cols, how='left')
# Step 5: Fill zeros only where confident mode exists
df['power'] = df.apply(
lambda row: row['mode_hp'] if row['power'] == 0 and pd.notna(row['mode_hp']) else row['power'],
axis=1
)
# Step 6: Clean up helper columns
df = df.drop(columns=['mode_hp', 'mode_freq'], errors='ignore')
return df
display(df_ft[df_ft['power'] == 0])
# 1,591 0HP values filled
df_car = fill_zero_power(df_ft)
df_car[df_car['power'] == 0]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24/03/2016 11:52 | 480 | NaN | 1993.0 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-24 | 0 | 70435 | 07/04/2016 03:16 | N | 1990s |
| 14 | 11/03/2016 21:39 | 450 | small | 1910.0 | NaN | 0 | ka | 5000 | 0 | petrol | ford | NaN | 2016-11-03 | 0 | 24148 | 19/03/2016 08:46 | Y: too early | before_1990 |
| 31 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | NaN | 0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s |
| 36 | 28/03/2016 17:50 | 1500 | bus | 2016.0 | NaN | 0 | kangoo | 150000 | 1 | gasoline | renault | no | 2016-03-28 | 0 | 46483 | 30/03/2016 09:18 | N | 2010_plus |
| 39 | 26/03/2016 22:06 | 0 | small | 1990.0 | NaN | 0 | corsa | 150000 | 1 | petrol | opel | NaN | 2016-03-26 | 0 | 56412 | 27/03/2016 17:43 | N | 1990s |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354098 | 19/03/2016 14:53 | 1750 | NaN | 1995.0 | NaN | 0 | NaN | 100000 | 12 | NaN | sonstige_autos | NaN | 2016-03-19 | 0 | 6188 | 01/04/2016 01:47 | NaN | 1990s |
| 354100 | 30/03/2016 07:54 | 0 | NaN | 2000.0 | NaN | 0 | NaN | 150000 | 0 | NaN | sonstige_autos | NaN | 2016-03-30 | 0 | 6686 | 06/04/2016 23:46 | NaN | 2000s |
| 354101 | 07/03/2016 19:51 | 1500 | NaN | 1995.0 | NaN | 0 | NaN | 150000 | 0 | NaN | volkswagen | NaN | 2016-07-03 | 0 | 26789 | 03/04/2016 11:46 | NaN | 1990s |
| 354104 | 31/03/2016 19:52 | 180 | NaN | 1995.0 | NaN | 0 | NaN | 125000 | 3 | petrol | opel | NaN | 2016-03-31 | 0 | 41470 | 06/04/2016 14:18 | NaN | 1990s |
| 354106 | 14/03/2016 17:48 | 2200 | NaN | 2005.0 | NaN | 0 | NaN | 20000 | 1 | NaN | sonstige_autos | NaN | 2016-03-14 | 0 | 39576 | 06/04/2016 00:46 | NaN | 2000s |
40218 rows × 18 columns
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24/03/2016 11:52 | 480 | NaN | 1993.0 | manual | 0.0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-24 | 0 | 70435 | 07/04/2016 03:16 | N | 1990s |
| 14 | 11/03/2016 21:39 | 450 | small | 1910.0 | NaN | 0.0 | ka | 5000 | 0 | petrol | ford | NaN | 2016-11-03 | 0 | 24148 | 19/03/2016 08:46 | Y: too early | before_1990 |
| 31 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | NaN | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s |
| 39 | 26/03/2016 22:06 | 0 | small | 1990.0 | NaN | 0.0 | corsa | 150000 | 1 | petrol | opel | NaN | 2016-03-26 | 0 | 56412 | 27/03/2016 17:43 | N | 1990s |
| 53 | 17/03/2016 07:56 | 4700 | wagon | 2005.0 | manual | 0.0 | signum | 150000 | 0 | NaN | opel | no | 2016-03-17 | 0 | 88433 | 04/04/2016 04:17 | N | 2000s |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354098 | 19/03/2016 14:53 | 1750 | NaN | 1995.0 | NaN | 0.0 | NaN | 100000 | 12 | NaN | sonstige_autos | NaN | 2016-03-19 | 0 | 6188 | 01/04/2016 01:47 | NaN | 1990s |
| 354100 | 30/03/2016 07:54 | 0 | NaN | 2000.0 | NaN | 0.0 | NaN | 150000 | 0 | NaN | sonstige_autos | NaN | 2016-03-30 | 0 | 6686 | 06/04/2016 23:46 | NaN | 2000s |
| 354101 | 07/03/2016 19:51 | 1500 | NaN | 1995.0 | NaN | 0.0 | NaN | 150000 | 0 | NaN | volkswagen | NaN | 2016-07-03 | 0 | 26789 | 03/04/2016 11:46 | NaN | 1990s |
| 354104 | 31/03/2016 19:52 | 180 | NaN | 1995.0 | NaN | 0.0 | NaN | 125000 | 3 | petrol | opel | NaN | 2016-03-31 | 0 | 41470 | 06/04/2016 14:18 | NaN | 1990s |
| 354106 | 14/03/2016 17:48 | 2200 | NaN | 2005.0 | NaN | 0.0 | NaN | 20000 | 1 | NaN | sonstige_autos | NaN | 2016-03-14 | 0 | 39576 | 06/04/2016 00:46 | NaN | 2000s |
38621 rows × 18 columns
def fill_all_missing_values(
df,
threshold=0.9,
verbose=True,
repeat_until_no_change=True,
max_loops=5
):
"""
Runs all fill functions in sequence (and optionally repeats)
until no more missing values are filled.
Parameters
----------
df : pd.DataFrame
The input dataframe.
threshold : float, optional (default=0.9)
Confidence threshold for majority-based fills.
verbose : bool, optional (default=True)
Print progress updates.
repeat_until_no_change : bool, optional (default=True)
If True, keeps looping until no new values are filled.
max_loops : int, optional (default=5)
Safety limit for maximum number of full passes.
Returns
-------
df : pd.DataFrame
The filled dataframe.
"""
df = df.copy()
steps = [
("Vehicle Type", fill_missing_vehicle_type),
("Model", fill_missing_models_majority),
("Fuel Type", fill_missing_fueltype),
("Power (0 HP)", fill_zero_power)
]
def count_missing(d):
return (
d['vehicletype'].isna().sum(),
d['model'].isna().sum(),
d['fueltype'].isna().sum(),
(d['power'] == 0).sum()
)
last_missing = count_missing(df)
loop = 0
while True:
loop += 1
if verbose:
print(f"\n🔁 Pass {loop} (threshold={threshold:.0%})")
for name, func in steps:
if verbose:
print(f" ▶ Running {name} fill function...")
try:
df = func(df, threshold=threshold)
except TypeError:
df = func(df)
except Exception as e:
print(f" ⚠️ Error in {name}: {e}")
current_missing = count_missing(df)
if verbose:
print(f" Missing counts after pass {loop}:")
print(f" vehicletype: {current_missing[0]:,}")
print(f" model: {current_missing[1]:,}")
print(f" fueltype: {current_missing[2]:,}")
print(f" power==0: {current_missing[3]:,}")
# Stop if no more changes
if not repeat_until_no_change:
break
if current_missing == last_missing:
if verbose:
print("\n✅ No further fills detected — stopping.")
break
if loop >= max_loops:
if verbose:
print("\n⚠️ Reached max loop limit, stopping.")
break
last_missing = current_missing
if verbose:
print("\n🏁 All fill functions completed.\n")
return df
df_car = fill_all_missing_values(df_car, repeat_until_no_change=True)
🔁 Pass 1 (threshold=90%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 17 missing models (threshold=90%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 1:
vehicletype: 22,779
model: 14,397
fueltype: 19,143
power==0: 38,574
🔁 Pass 2 (threshold=90%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=90%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 2:
vehicletype: 22,720
model: 14,397
fueltype: 19,112
power==0: 38,565
🔁 Pass 3 (threshold=90%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=90%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 3:
vehicletype: 22,720
model: 14,397
fueltype: 19,112
power==0: 38,565
✅ No further fills detected — stopping.
🏁 All fill functions completed.
def correct_registration_years_x(df, threshold=0.9, proximity=1):
"""
Corrects registration years flagged as 'too early' or 'too late'.
Adds ±proximity tolerance when determining majority years.
"""
df = df.copy()
# --- Split flagged vs correct ---
flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
flagged = df[flagged_mask].copy()
correct = df[~flagged_mask].copy()
if flagged.empty:
return df # nothing to fix
# --- Helper: Cluster nearby years (±proximity) ---
def cluster_years(series, proximity=1):
if series.empty:
return np.nan, 0
years = series.dropna().astype(int)
if years.empty:
return np.nan, 0
clusters = []
for y in sorted(years.unique()):
found = False
for cluster in clusters:
if abs(cluster['years'][-1] - y) <= proximity:
cluster['years'].append(y)
cluster['count'] += (years == y).sum()
found = True
break
if not found:
clusters.append({'years': [y], 'count': (years == y).sum()})
top_cluster = max(clusters, key=lambda c: c['count'])
cluster_year = int(np.round(np.mean(top_cluster['years'])))
freq = top_cluster['count'] / len(years)
return cluster_year, freq
# --- Compute majority year per group ---
def get_majority_table(group_cols):
rows = []
for name, group in correct.groupby(group_cols):
year, freq = cluster_years(group['registrationyear'], proximity)
rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])
# Start with detailed grouping
majority_df = get_majority_table(['brand','model','power','vehicletype'])
flagged = flagged.merge(majority_df, on=['brand','model','power','vehicletype'], how='left')
# --- Fallback using (brand, model, vehicletype) ---
missing_mask = flagged['majority_year'].isna()
if missing_mask.any():
fallback = get_majority_table(['brand','model','vehicletype'])
flagged = flagged.merge(
fallback,
on=['brand','model','vehicletype'],
how='left',
suffixes=('','_fallback')
)
# Fill in missing majority fields from fallback where possible
for col in ['majority_year','mode_freq','min','max']:
flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])
# Clean up helper columns
flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)
# --- Apply corrections ---
def fill_year(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
return row['majority_year']
elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
return row['min']
elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
return row['max']
else:
return row['registrationyear']
def fill_flag(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
return 'N'
else:
return row['registration_correction']
flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)
# --- Cleanup helper cols ---
flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)
# --- Combine back safely ---
result = pd.concat([correct, flagged], ignore_index=True)
return result
df_years = correct_registration_years_x(df_car)
df_years[(df_years['registration_correction'] == "Y: too early") | (df_years['registration_correction'] == "Y: too late")]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 350533 | 11/03/2016 21:39 | 450 | small | 1996.0 | NaN | 0.0 | ka | 5000 | 0 | petrol | ford | NaN | 2016-11-03 | 0 | 24148 | 19/03/2016 08:46 | Y: too early | before_1990 |
| 350534 | 03/04/2016 20:44 | 1999 | small | 2016.0 | manual | 110.0 | almera | 150000 | 9 | gasoline | nissan | NaN | 2016-03-04 | 0 | 10997 | 05/04/2016 21:17 | Y: too late | 2010_plus |
| 350535 | 31/03/2016 19:43 | 1200 | NaN | 2016.0 | manual | 75.0 | modus | 150000 | 0 | petrol | renault | yes | 2016-03-31 | 0 | 47546 | 31/03/2016 19:43 | Y: too late | 2010_plus |
| 350537 | 04/04/2016 13:48 | 1490 | sedan | 1993.0 | auto | 136.0 | e_klasse | 150000 | 2 | petrol | mercedes_benz | no | 2016-04-04 | 0 | 13349 | 06/04/2016 15:16 | Y: too early | 1990s |
| 350538 | 08/03/2016 18:51 | 350 | wagon | 1998.0 | manual | 0.0 | 6_reihe | 150000 | 0 | petrol | mazda | yes | 2016-08-03 | 0 | 54655 | 07/04/2016 14:56 | Y: too early | 1990s |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354098 | 07/03/2016 21:59 | 2199 | NaN | 2016.0 | manual | 0.0 | samara | 70000 | 0 | NaN | lada | no | 2016-07-03 | 0 | 1796 | 08/03/2016 13:16 | Y: too late | 2010_plus |
| 354099 | 30/03/2016 08:56 | 3899 | bus | 2002.0 | manual | 200.0 | signum | 150000 | 6 | petrol | opel | no | 2016-03-30 | 0 | 84187 | 30/03/2016 08:56 | Y: too early | 2000s |
| 354100 | 20/03/2016 10:48 | 900 | NaN | 1995.0 | manual | 0.0 | 601 | 5000 | 0 | petrol | trabant | NaN | 2016-03-20 | 0 | 9623 | 02/04/2016 09:46 | Y: too late | 1990s |
| 354102 | 12/03/2016 09:56 | 600 | NaN | 2016.0 | manual | 170.0 | vectra | 150000 | 0 | petrol | opel | yes | 2016-12-03 | 0 | 67475 | 06/04/2016 01:46 | Y: too late | 2010_plus |
| 354103 | 29/03/2016 19:50 | 3000 | sedan | 2016.0 | manual | 0.0 | colt | 150000 | 8 | gasoline | mitsubishi | no | 2016-03-29 | 0 | 45472 | 06/04/2016 05:46 | Y: too late | 2010_plus |
1654 rows × 18 columns
gc.collect()
display(df_years['registration_correction'].value_counts(dropna = False))
df_years1 = correct_registration_years_x(df_years, threshold = 0.9, proximity = 5)
display(df_years1['registration_correction'].value_counts(dropna = False))
df_reg = correct_registration_years_x(df_years1, threshold = 0.9, proximity = 10)
df_reg['registration_correction'].value_counts(dropna = False)
df_reg = correct_registration_years_x(df_reg, threshold = 0.8, proximity = 1)
df_reg['registration_correction'].value_counts(dropna = False)
df_reg = correct_registration_years_x(df_reg, threshold = 0.8, proximity = 5)
df_reg['registration_correction'].value_counts(dropna = False)
df_reg = correct_registration_years_x(df_reg, threshold = 0.8, proximity = 10)
df_reg['registration_correction'].value_counts(dropna = False)
df_reg = correct_registration_years_x(df_reg, threshold = 0.7, proximity = 1)
df_reg['registration_correction'].value_counts(dropna = False)
df_reg = correct_registration_years_x(df_reg, threshold = 0.7, proximity = 5)
df_reg['registration_correction'].value_counts(dropna = False)
df_reg = correct_registration_years_x(df_reg, threshold = 0.7, proximity = 10)
display(df_reg['registration_correction'].value_counts(dropna = False))
df_reg[(df_reg['registration_correction'] == 'Y: too early')].value_counts(subset = 'model').index
N 332752 NaN 19701 Y: too late 888 Y: too early 766 Name: registration_correction, dtype: int64
N 333250 NaN 19701 Y: too late 729 Y: too early 427 Name: registration_correction, dtype: int64
N 333741 NaN 19701 Y: too late 518 Y: too early 147 Name: registration_correction, dtype: int64
Index(['e_klasse', '6_reihe', 'i3', 'golf', 'antara', '3er', '911', 'cx_reihe',
'vivaro', 'v60', 'glk', 'kuga', 'signum', 'insignia',
'range_rover_evoque', 'combo', 'up', 'serie_1', 'tucson', 'a2', '601',
'passat', 'modus', '145', 'picanto', 'q3', 'tigra', 'x_reihe', 'spark',
'fabia', 'kangoo', 'fox', '300c', 'escort', 'cr_reihe', 'cc', 'cayenne',
'c_klasse', 'c2', 'beetle', 'b_klasse', 'astra', 'a_klasse', 'a3',
'500', 'zafira'],
dtype='object', name='model')
models_for_median = [
'e_klasse', '6_reihe', 'i3', 'golf', 'antara', '3er', '911', 'cx_reihe',
'vivaro', 'v60', 'glk', 'kuga', 'signum', 'insignia',
'range_rover_evoque', 'combo', 'up', 'serie_1', 'tucson', 'a2', '601',
'passat', 'modus', '145', 'picanto', 'q3', 'tigra', 'x_reihe', 'spark',
'fabia', 'kangoo', 'fox', '300c', 'escort', 'cr_reihe', 'cc', 'cayenne',
'c_klasse', 'c2', 'beetle', 'b_klasse', 'astra', 'a_klasse', 'a3',
'500', 'zafira'
]
# For faster lookups
models_for_median = set(models_for_median)
# Loop through each model
for model in models_for_median:
# mask for this model & still marked too early
mask = (
(df_reg['model'] == model) &
(df_reg['registration_correction'] == "Y: too early")
)
# compute median from the VALID years for this model
median_year = (
df_reg.loc[
(df_reg['model'] == model) &
df_reg['registrationyear'].between(1885, 2026)
]['registrationyear']
.median()
)
# fill with the median
df_reg.loc[mask, 'registrationyear'] = median_year
# mark as corrected
df_reg.loc[mask, 'registration_correction'] = "N"
display(df_reg['registration_correction'].value_counts(dropna = False))
df_reg[(df_reg['registration_correction'] == 'Y: too late')].value_counts(subset = 'model').index
N 333888 NaN 19701 Y: too late 518 Name: registration_correction, dtype: int64
Index(['vectra', 'escort', '147', 'cordoba', 'colt', 'lanos', '100', 'primera',
'80', 'kadett', 'modus', '90', 'a2', 'omega', 'ptcruiser', 'stilo',
'move', 'tigra', 'other', 'roomster', 'bora', 'bravo', 'croma', 'r19',
'145', 'altea', '9000', 'samara', 'santa', 'musa', 'galant', '900',
'kalina', '601', 'elefantino', 'juke', '911', 'almera', 'aveo',
'toledo', 'serie_2', 'lybra', 'calibra', 'crossfire', '159', 'delta',
'getz', 'rangerover'],
dtype='object', name='model')
models_for_median = [
'vectra', 'escort', '147', 'cordoba', 'colt', 'lanos', '100', 'primera',
'80', 'kadett', 'modus', '90', 'a2', 'omega', 'ptcruiser', 'stilo',
'move', 'tigra', 'other', 'roomster', 'bora', 'bravo', 'croma', 'r19',
'145', 'altea', '9000', 'samara', 'santa', 'musa', 'galant', '900',
'kalina', '601', 'elefantino', 'juke', '911', 'almera', 'aveo',
'toledo', 'serie_2', 'lybra', 'calibra', 'crossfire', '159', 'delta',
'getz', 'rangerover'
]
# For faster lookups
models_for_median = set(models_for_median)
# Loop through each model
for model in models_for_median:
# mask for this model & still marked too late
mask = (
(df_reg['model'] == model) &
(df_reg['registration_correction'] == "Y: too late")
)
# compute median from the VALID years for this model
median_year = (
df_reg.loc[
(df_reg['model'] == model) &
df_reg['registrationyear'].between(1885, 2026)
]['registrationyear']
.median()
)
# fill with the median
df_reg.loc[mask, 'registrationyear'] = median_year
# mark as corrected
df_reg.loc[mask, 'registration_correction'] = "N"
display(df_reg['registration_correction'].value_counts(dropna = False))
display(df_reg[(df_reg['registration_correction'].isna())].value_counts(subset = 'model').index.to_list())
N 334406 NaN 19701 Name: registration_correction, dtype: int64
['other', '3er', 'corsa', 'golf', 'astra', 'polo', 'passat', 'fortwo', '2_reihe', 'transporter', '3_reihe', 'a4', 'c_klasse', 'punto', 'a3', 'a_klasse', 'fiesta', 'focus', '5er', 'e_klasse', '80', 'zafira', 'micra', 'a6', '6_reihe', 'civic', 'megane', 'mondeo', 'colt', 'i_reihe', '601', 'vectra', 'fabia', 'clio', 'x_reihe', 'ibiza', 'laguna', '1_reihe', 'clk', 'cooper', '1er', '156', 'lupo', 'corolla', 'escort', 'rio', 'cordoba', 'scenic', 'caddy', 'arosa', 'twingo', 'octavia', 'galaxy', 'v40', 'ypsilon', 'a5', 'tt', 'jazz', 'sorento', 'swift', 'primera', 'tigra', 'm_klasse', 'carisma', 'stilo', 'kangoo', 's_klasse', 'omega', '100', 'a8', 'beetle', 'sharan', 'v70', 'getz', 'logan', 'one', 'lancer', '7er', 'seicento', 'cr_reihe', 'matiz', 'leon', 'tiguan', 'touran', 'sprinter', 'mx_reihe', 'jimny', 'touareg', 'freelander', 'espace', 'cuore', 'voyager', 'yaris', 'superb', 'justy', '850', 'berlingo', 'santa', '147', 'scirocco', 'niva', 'toledo', 'panda', 'picanto', 'signum', 'kuga', True, 'aygo', '5_reihe', 'bravo', 'alhambra', 'c4', 'galant', '6er', 'cayenne', '500', 'c5', 'kadett', 'avensis', 'ducato', 'carnival', 'ptcruiser', 'phaeton', 'c3', 'doblo', 'calibra', 'fox', 'lanos', '4_reihe', 'altea', 'rav', 'transit', 'roomster', 'z_reihe', 'vivaro', 'sportage', 'accord', 'almera', 'sl', 'slk', 'b_klasse', 'auris', 's_type', 'grand', 'nubira', 'duster', 'ka', 'qashqai', 'insignia', 'captiva', 'ceed', 'cl', 'forfour', 'combo', 'lybra', 'jetta', 'r19', 'meriva', 'terios', '300c', 'kalos', 'pajero', '159', 'c1', 'cherokee', 'sandero', 'v50', 'x_type', 'impreza', 'xc_reihe', 'legacy', '145', 'vito', 'lodgy', 'rangerover', '90', '900', '911', 'a1', 'a2', 'm_reihe', 'forester', 'eos', 'clubman', 'sirion', 'c2', 'c_reihe', 'fusion']
models_for_median = [ '3er','corsa','golf','astra','polo','passat','fortwo','2_reihe','transporter','a4','c_klasse',
'punto','a3','a_klasse','fiesta','focus','5er','e_klasse','80','zafira','micra','a6','6_reihe','civic','megane','mondeo',
'colt','i_reihe','601','vectra','fabia','clio','x_reihe','ibiza','laguna','clk','cooper','1er','156','lupo','corolla',
'escort','rio','cordoba','scenic','caddy','arosa','twingo','octavia','galaxy','v40','ypsilon','a5','tt','jazz','sorento',
'swift','primera','tigra','m_klasse','carisma','stilo','kangoo','s_klasse','omega','100','a8','beetle','sharan','v70',
'getz','logan','one','lancer','7er','seicento','cr_reihe','matiz','leon','tiguan','touran','sprinter','mx_reihe','jimny','touareg',
'freelander','espace','cuore','voyager','yaris','superb','justy','850','berlingo','santa','147','scirocco','niva',
'toledo','panda','picanto','signum','kuga','aygo','bravo','alhambra','c4','galant','6er','cayenne','500','c5','kadett',
'avensis','ducato','carnival','ptcruiser','phaeton','c3','doblo','calibra','fox','lanos','4_reihe','altea','rav','transit','roomster','z_reihe','vivaro',
'sportage','accord','almera','sl','slk','b_klasse','auris','s_type','grand','nubira','duster','ka','qashqai','insignia','captiva',
'ceed','cl','forfour','combo','lybra','jetta','r19','meriva','terios','300c','kalos','pajero','159','c1','cherokee','sandero',
'v50','x_type','impreza','xc_reihe','legacy','145','vito','lodgy','rangerover','90','900','911','a1','a2','m_reihe','forester',
'eos','clubman','sirion','c2','c_reihe','fusion']
# For faster lookups
models_for_median = set(models_for_median)
# Loop through each model
for model in models_for_median:
# mask for this model & still NaN
mask = (
(df_reg['model'] == model) &
(df_reg['registration_correction'].isna())
)
# compute median from the VALID years for this model
median_year = (
df_reg.loc[
(df_reg['model'] == model) &
df_reg['registrationyear'].between(1885, 2016)
]['registrationyear']
.median()
)
# fill with the median
df_reg.loc[mask, 'registrationyear'] = median_year
# mark as corrected
df_reg.loc[mask, 'registration_correction'] = "N"
display(df_reg['registration_correction'].value_counts(dropna = False))
display(df_reg[df_reg['registration_correction'].isna()].value_counts(subset = 'model'))
display(df_reg[(df_reg['model'] == True)])
# To fill in model values = True (all are skoda brand with wagon vehicletype)
# When HP 102: octavia - 150, True - 3 (invalid model), fabia - 3 [majority is octavia]
# When HP 150: ocatvia - 88, True - 1 (invalid model) [majority is octavia]
# HP 105: see below (70% are octavia (not overwhelming, but there is only 1 True labeled model marked at 105 - can justify filling in octavia as the model))
# Added registration year == 2008: octavia - 30, fabia - 25, roomster - 1 [octavia is still the majority, though not by alot] - can still justify since only 1 fill required for this HP
display(df_reg[(df_reg['brand'] == 'skoda') & (df_reg['power'].isin([105])) & (df_reg['vehicletype'] == 'wagon')].value_counts(subset = 'model'))
N 338696 NaN 15411 Name: registration_correction, dtype: int64
model other 872 3_reihe 109 1_reihe 23 True 5 5_reihe 5 dtype: int64
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21785 | 31/03/2016 18:55 | 2100 | wagon | 2001.0 | manual | 102.0 | True | 150000 | 6 | petrol | skoda | no | 2016-03-31 | 0 | 19075 | 06/04/2016 13:15 | NaN | 2000s |
| 44767 | 25/03/2016 13:37 | 2500 | wagon | 2001.0 | auto | 150.0 | True | 150000 | 8 | petrol | skoda | no | 2016-03-25 | 0 | 6366 | 06/04/2016 15:44 | NaN | 2000s |
| 57342 | 27/03/2016 15:42 | 4950 | wagon | 2008.0 | manual | 105.0 | True | 150000 | 1 | gasoline | skoda | no | 2016-03-27 | 0 | 51580 | 03/04/2016 11:18 | NaN | 2000s |
| 170162 | 05/03/2016 14:26 | 3500 | wagon | 2007.0 | manual | 102.0 | True | 150000 | 2 | NaN | skoda | NaN | 2016-05-03 | 0 | 60313 | 05/03/2016 16:48 | NaN | 2000s |
| 190632 | 04/04/2016 00:06 | 5700 | wagon | 2005.0 | manual | 102.0 | True | 150000 | 11 | petrol | skoda | no | 2016-03-04 | 0 | 56203 | 05/04/2016 12:12 | NaN | 2000s |
model octavia 263 fabia 81 other 16 superb 8 roomster 7 True 1 dtype: int64
octavia = df_reg['model'] == True
df_reg.loc[octavia,['model']] = 'octavia'
display(df_reg[df_reg['registration_correction'].isna()].value_counts(subset = 'model'))
model other 872 3_reihe 109 1_reihe 23 5_reihe 5 octavia 5 dtype: int64
# Got this by some variation of this: df_reg[(df_reg['model'] == 'other') & (df_reg['registration_correction'].isna())].value_counts(subset = 'brand')
brand_model_pairs = [
('skoda', 'octavia'),
('mazda', '5_reihe'),
('peugeot', '5_reihe'),
('mazda', '1_reihe'),
('peugeot', '1_reihe'),
('mazda', '3_reihe'),
('peugeot', '3_reihe'),
# all the 'other' models:
('hyundai', 'other'),
('ford', 'other'),
('citroen', 'other'),
('chevrolet', 'other'),
('suzuki', 'other'),
('rover', 'other'),
('opel', 'other'),
('mazda', 'other'),
('mercedes_benz', 'other'),
('nissan', 'other'),
('peugeot', 'other'),
('fiat', 'other'),
('mitsubishi', 'other'),
('toyota', 'other'),
('chrysler', 'other'),
('kia', 'other'),
('volvo', 'other'),
('skoda', 'other'),
('alfa_romeo', 'other'),
('renault', 'other'),
('honda', 'other'),
('saab', 'other'),
('porsche', 'other'),
('audi', 'other'),
('volkswagen', 'other'),
('daewoo', 'other'),
('lada', 'other'),
('jeep', 'other'),
('jaguar', 'other'),
('daihatsu', 'other'),
('subaru', 'other'),
('seat', 'other'),
('bmw', 'other'),
('trabant', 'other'),
('smart', 'other'),
('mini', 'other')
]
for brand, model in brand_model_pairs:
# mask for valid rows for this brand+model
valid_mask = (
(df_reg['brand'] == brand) &
(df_reg['model'] == model) &
df_reg['registrationyear'].between(1885, 2016)
)
# compute median ONLY for this brand+model
median_year = df_reg.loc[valid_mask, 'registrationyear'].median()
# rows we want to fix: correction is NaN
fix_mask = (
(df_reg['brand'] == brand) &
(df_reg['model'] == model) &
(df_reg['registration_correction'].isna())
)
# apply the median
df_reg.loc[fix_mask, 'registrationyear'] = median_year
# mark correction as N (corrected)
df_reg.loc[fix_mask, 'registration_correction'] = 'N'
display(df_reg['registration_correction'].value_counts(dropna = False))
display(df_reg[df_reg['registration_correction'].isna()].value_counts(subset = 'brand').index.to_list())
N 339710 NaN 14397 Name: registration_correction, dtype: int64
['sonstige_autos', 'volkswagen', 'bmw', 'opel', 'audi', 'mercedes_benz', 'ford', 'peugeot', 'renault', 'fiat', 'mazda', 'citroen', 'seat', 'smart', 'nissan', 'alfa_romeo', 'hyundai', 'honda', 'toyota', 'suzuki', 'mitsubishi', 'skoda', 'trabant', 'volvo', 'chrysler', 'kia', 'chevrolet', 'mini', 'subaru', 'daewoo', 'porsche', 'rover', 'daihatsu', 'jeep', 'saab', 'lancia', 'dacia', 'lada', 'land_rover', 'jaguar']
brands_to_fill = [
'volkswagen','bmw','opel','audi','mercedes_benz','ford','renault','peugeot',
'fiat','mazda','citroen','seat','smart','hyundai','alfa_romeo','nissan',
'toyota','honda','trabant','mitsubishi','suzuki','skoda','volvo','chevrolet',
'kia','mini','daihatsu','subaru','daewoo','chrysler','rover','porsche','jeep',
'dacia','lancia','jaguar','land_rover','lada','saab'
]
# Compute brand-level medians (using only reasonable registration years)
reasonable_mask = df_reg['registrationyear'].between(1885, 2016)
brand_medians = df_reg.loc[reasonable_mask].groupby('brand')['registrationyear'].median()
for brand in brands_to_fill:
median_year = brand_medians.get(brand)
if pd.isna(median_year):
continue
fix_mask = (
(df_reg['brand'] == brand) &
(df_reg['registration_correction'].isna())
)
df_reg.loc[fix_mask, 'registrationyear'] = median_year
df_reg.loc[fix_mask, 'registration_correction'] = 'N'
display(df_reg['registration_correction'].value_counts(dropna = False))
display(df_reg[df_reg['registration_correction'].isna()].value_counts(subset = 'brand'))
N 350734 NaN 3373 Name: registration_correction, dtype: int64
brand sonstige_autos 3373 dtype: int64
display(df_reg.info())
df_reg = df_reg[~df_reg['registration_correction'].isna()].copy()
display(df_reg.info())
del df_years
del df_years1
gc.collect()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 354107 entries, 0 to 354106 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 354107 non-null object 1 price 354107 non-null int64 2 vehicletype 331387 non-null object 3 registrationyear 354107 non-null float64 4 gearbox 334277 non-null object 5 power 354107 non-null float64 6 model 339710 non-null object 7 mileage 354107 non-null int64 8 registrationmonth 354107 non-null int64 9 fueltype 334995 non-null object 10 brand 354107 non-null object 11 notrepaired 282962 non-null object 12 datecreated 354107 non-null datetime64[ns] 13 numberofpictures 354107 non-null int64 14 postalcode 354107 non-null int64 15 lastseen 354107 non-null object 16 registration_correction 350734 non-null object 17 year_bin 354075 non-null object dtypes: datetime64[ns](1), float64(2), int64(5), object(10) memory usage: 48.6+ MB
None
<class 'pandas.core.frame.DataFrame'> Int64Index: 350734 entries, 0 to 354106 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 350734 non-null object 1 price 350734 non-null int64 2 vehicletype 329055 non-null object 3 registrationyear 350734 non-null float64 4 gearbox 332007 non-null object 5 power 350734 non-null float64 6 model 339710 non-null object 7 mileage 350734 non-null int64 8 registrationmonth 350734 non-null int64 9 fueltype 332726 non-null object 10 brand 350734 non-null object 11 notrepaired 280912 non-null object 12 datecreated 350734 non-null datetime64[ns] 13 numberofpictures 350734 non-null int64 14 postalcode 350734 non-null int64 15 lastseen 350734 non-null object 16 registration_correction 350734 non-null object 17 year_bin 350702 non-null object dtypes: datetime64[ns](1), float64(2), int64(5), object(10) memory usage: 50.8+ MB
None
0
fix80 = (df_reg['registrationyear'] < 1990) & (df_reg['year_bin'] != 'before_1990')
df_reg.loc[fix80,['year_bin']] = 'before_1990'
fix90 = (df_reg['registrationyear'] > 1989) & (df_reg['registrationyear'] < 2000) & (df_reg['year_bin'] != '1990s')
df_reg.loc[fix90,['year_bin']] = '1990s'
fix00 = (df_reg['registrationyear'] > 1999) & (df_reg['registrationyear'] < 2010) & (df_reg['year_bin'] != '2000s')
df_reg.loc[fix00,['year_bin']] = '2000s'
fix10 = (df_reg['registrationyear'] > 2009) & (df_reg['year_bin'] != '2010_plus')
df_reg.loc[fix10,['year_bin']] = '2010_plus'
df_app = fill_all_missing_values(df_reg, threshold = 0.9, repeat_until_no_change=True)
🔁 Pass 1 (threshold=90%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 1 missing models (threshold=90%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 1:
vehicletype: 21,555
model: 9,755
fueltype: 17,545
power==0: 37,068
🔁 Pass 2 (threshold=90%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=90%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 2:
vehicletype: 21,270
model: 9,755
fueltype: 17,495
power==0: 37,068
🔁 Pass 3 (threshold=90%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=90%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 3:
vehicletype: 21,270
model: 9,755
fueltype: 17,495
power==0: 37,068
✅ No further fills detected — stopping.
🏁 All fill functions completed.
df_app3 = fill_all_missing_values(df_app, threshold = 0.8)
df_app3.isna().sum()
🔁 Pass 1 (threshold=80%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 114 missing models (threshold=80%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 1:
vehicletype: 18,859
model: 8,989
fueltype: 15,358
power==0: 35,995
🔁 Pass 2 (threshold=80%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=80%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 2:
vehicletype: 18,697
model: 8,921
fueltype: 15,240
power==0: 35,982
🔁 Pass 3 (threshold=80%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=80%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 3:
vehicletype: 18,697
model: 8,921
fueltype: 15,238
power==0: 35,982
🔁 Pass 4 (threshold=80%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=80%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 4:
vehicletype: 18,697
model: 8,921
fueltype: 15,238
power==0: 35,982
✅ No further fills detected — stopping.
🏁 All fill functions completed.
datecrawled 0 price 0 vehicletype 18697 registrationyear 0 gearbox 18727 power 0 model 8921 mileage 0 registrationmonth 0 fueltype 15238 brand 0 notrepaired 69822 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 year_bin 0 dtype: int64
del df_app
# HP too high
too_high_hp = (df_app3['power'] > 999)
df_app3.loc[too_high_hp,['power']] = 0
hp_toohigh = (df_app3['power'] > 621) & (df_app3['model'] != 'other') & (df_app3['model'] != '5er')
df_app3.loc[hp_toohigh,['power']] = 0
hp_high = (df_app3['power'] > 450) & (~(df_app3['brand'].isin(['mercedes_benz','audi','bmw','porsche','ford']))) & (df_app3['model'] != 'other')
df_app3.loc[hp_high,['power']] = 0
vwgolfhigh = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'golf') & (df_app3['power'] > 306)
df_app3.loc[vwgolfhigh,['power']] = 0
polohighe = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'polo') & (df_app3['power'] > 200)
df_app3.loc[polohighe,['power']] = 0
passathigh = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'passat') & (df_app3['power'] > 300)
df_app3.loc[passathigh,['power']] = 0
jagxhigh = (df_app3['brand'] == 'jaguar') & (df_app3['model'] == 'x_type') & (df_app3['power'] > 240)
df_app3.loc[jagxhigh,['power']] = 0
captivahigh = (df_app3['brand'] == 'chevrolet') & (df_app3['model'] == 'captiva') & (df_app3['power'] > 258)
df_app3.loc[captivahigh,['power']] = 0
vwhigh = (df_app3['brand'] == 'volkswagen') & (df_app3['power'] > 420)
df_app3.loc[vwhigh,['power']] = 0
citroenhigh = (df_app3['brand'] == 'citroen') & (df_app3['power'] > 241)
df_app3.loc[citroenhigh,['power']] = 0
chryslerhigh = (df_app3['brand'] == 'chrysler') & (df_app3['power'] > 470)
df_app3.loc[chryslerhigh,['power']] = 0
fiathigh = (df_app3['brand'] == 'fiat') & (df_app3['power'] > 220)
df_app3.loc[fiathigh,['power']] = 0
suzukihigh = (df_app3['brand'] == 'suzuki') & (df_app3['power'] > 290)
df_app3.loc[suzukihigh,['power']] = 0
arhigh = (df_app3['brand'] == 'alfa_romeo') & (df_app3['power'] > 505)
df_app3.loc[arhigh,['power']] = 0
fordhigh = (df_app3['brand'] == 'ford') & (df_app3['power'] > 760)
df_app3.loc[fordhigh,['power']] = 0
chevyhigh = (df_app3['brand'] == 'chevrolet') & (df_app3['power'] > 650)
df_app3.loc[chevyhigh,['power']] = 0
hyundaihigh = (df_app3['brand'] == 'hyundai') & (df_app3['power'] > 370)
df_app3.loc[hyundaihigh,['power']] = 0
mitsubishihigh = (df_app3['brand'] == 'mitsubishi') & (df_app3['power'] > 440)
df_app3.loc[mitsubishihigh,['power']] = 0
nissanhigh = (df_app3['brand'] == 'nissan') & (df_app3['power'] > 600)
df_app3.loc[nissanhigh,['power']] = 0
opelhigh = (df_app3['brand'] == 'opel') & (df_app3['power'] > 577)
df_app3.loc[opelhigh,['power']] = 0
pehigh = (df_app3['brand'] == 'peugeot') & (df_app3['power'] > 360)
df_app3.loc[pehigh,['power']] = 0
seathigh = (df_app3['brand'] == 'seat') & (df_app3['power'] > 340)
df_app3.loc[seathigh,['power']] = 0
volvohigh = (df_app3['brand'] == 'volvo') & (df_app3['power'] > 510)
df_app3.loc[volvohigh,['power']] = 0
smarthigh = (df_app3['brand'] == 'smart') & (df_app3['power'] > 422)
df_app3.loc[smarthigh,['power']] = 0
# HP too low
too_low = (df_app3['power']>0) & (df_app3['power']<5)
df_app3.loc[too_low,['power']] = 0
bmw = (df_app3['brand'] == 'bmw') & (df_app3['model'] == 'bmw')
df_app3.loc[bmw,['model']] = None
opellow = (df_app3['brand'] == 'opel') & (df_app3['power'] > 0) & (df_app3['power']<40) & (df_app3['model'] != 'other')
df_app3.loc[opellow,['power']] = 0
vwlow = (df_app3['brand'] == 'volkswagen') & (df_app3['power']>0) & (df_app3['power']<30) & (df_app3['model'] != 'other')
df_app3.loc[vwlow,['power']] = 0
citroenlow = (df_app3['brand'] == 'citroen') & (df_app3['power']>0) & (df_app3['power'] < 32) & (df_app3['model'] != 'other')
df_app3.loc[citroenlow,['power']] = 0
fordlow = (df_app3['brand'] == 'ford') & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[fordlow,['power']] = 0
renaultlow = (df_app3['brand'] == 'renault') & (df_app3['power']>0) & (df_app3['power'] < 32) & (df_app3['model'] != 'other')
df_app3.loc[renaultlow,['power']] = 0
merclow = (df_app3['brand'] == 'mercedes_benz') & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[merclow,['power']] = 0
bmwlow = (df_app3['brand'] == 'bmw') & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[bmwlow,['power']] = 0
audilow = (df_app3['brand'] == 'audi') & (df_app3['power']>0) & (df_app3['power'] < 44) & (df_app3['model'] != 'other')
df_app3.loc[audilow,['power']] = 0
fiatlow = (df_app3['brand'] == 'fiat') & (df_app3['power']>0) & (df_app3['power'] < 13) & (df_app3['model'] != 'other')
df_app3.loc[fiatlow,['power']] = 0
pelow = (df_app3['brand'] == 'peugeot') & (df_app3['power']>0) & (df_app3['power'] < 34) & (df_app3['model'] != 'other')
df_app3.loc[pelow,['power']] = 0
trabantlow = (df_app3['brand'] == 'trabant') & (df_app3['power']>0) & (df_app3['power'] < 23) & (df_app3['model'] != 'other')
df_app3.loc[trabantlow,['power']] = 0
nislow = (df_app3['brand'] == 'nissan') & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[nislow,['power']] = 0
sk45 = (df_app3['brand'].isin(['mazda','smart','seat','skoda','mitsubishi','toyota','volvo','honda','suzuki'])) & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[sk45,['power']] = 0
hylow = (df_app3['brand'].isin(['hyundai'])) & (df_app3['power']>0) & (df_app3['power'] < 49) & (df_app3['model'] != 'other')
df_app3.loc[hylow,['power']] = 0
subarulow = (df_app3['brand'].isin(['subaru'])) & (df_app3['power']>0) & (df_app3['power'] < 54) & (df_app3['model'] != 'other')
df_app3.loc[subarulow,['power']] = 0
dacialow = (df_app3['brand'].isin(['dacia'])) & (df_app3['power']>0) & (df_app3['power'] < 67) & (df_app3['model'] != 'other')
df_app3.loc[dacialow,['power']] = 0
k55 = (df_app3['brand'].isin(['rover','kia','lancia'])) & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[k55,['power']] = 0
lrlow = (df_app3['brand'].isin(['land_rover'])) & (df_app3['power']>0) & (df_app3['power'] < 50) & (df_app3['model'] != 'other')
df_app3.loc[k55,['power']] = 0
fiat500low = (df_app3['brand'] == 'fiat') & (df_app3['model'] == '500') & (df_app3['registrationyear'] > 1975) & (df_app3['power']>0) & (df_app3['power']< 69) & (df_app3['model'] != 'other') & (df_app3['brand'] != 'sonstige_autos')
df_app3.loc[fiat500low,['power']] = 0
freelanderlow = (df_app3['brand'] == 'land_rover') & (df_app3['model'] == 'freelander') & (df_app3['power']>0) & (df_app3['power']< 109)
df_app3.loc[freelanderlow,['power']] = 0
pandalow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'panda') & (df_app3['power'] > 0) & (df_app3['power']<30)
df_app3.loc[pandalow,['power']] = 0
seilow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'seicento') & (df_app3['power'] > 0) & (df_app3['power']<39)
df_app3.loc[seilow,['power']] = 0
stilow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'stilo') & (df_app3['power'] > 0) & (df_app3['power']<59)
df_app3.loc[stilow,['power']] = 0
beetle03 = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'beetle') & (df_app3['registrationyear'] >2002) & (df_app3['power'] > 0) & (df_app3['power']<75)
df_app3.loc[beetle03,['power']] = 0
polow = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'polo') & (df_app3['power']>0) & (df_app3['power'] < 37)
df_app3.loc[polow,['power']] = 0
luplow = (df_app3['model'] == 'lupo') & (df_app3['power']>0) & (df_app3['power'] < 45)
df_app3.loc[luplow,['power']] = 0
golflow = (df_app3['model'] == 'golf') & (df_app3['power']>0) & (df_app3['power'] < 50)
df_app3.loc[golflow,['power']] = 0
movlow = (df_app3['model'] == 'move') & (df_app3['power']>0) & (df_app3['power'] < 40)
df_app3.loc[movlow,['power']] = 0
sharanlow = (df_app3['model'] == 'sharan') & (df_app3['power']>0) & (df_app3['power'] < 90)
df_app3.loc[sharanlow,['power']] = 0
twinlow = (df_app3['model'] == 'twingo') & (df_app3['power']>0) & (df_app3['power'] < 40)
del too_high_hp, hp_toohigh, hp_high, vwgolfhigh, polohighe, passathigh, jagxhigh, captivahigh, vwhigh, citroenhigh, chryslerhigh, fiathigh,
del suzukihigh, arhigh, fordhigh, chevyhigh, hyundaihigh, mitsubishihigh, nissanhigh, opelhigh, pehigh, seathigh, volvohigh, smarthigh
del too_low, bmw, opellow, vwlow, citroenlow, fordlow, renaultlow, merclow, bmwlow, audilow, fiatlow, pelow, trabantlow, nislow, sk45, hylow, subarulow
del dacialow, k55, lrlow, fiat500low, freelanderlow, pandalow, seilow, stilow, beetle03, polow, luplow, golflow, movlow, sharanlow, twinlow
gc.collect()
0
def fill_gearbox(df, threshold=0.9, verbose=True):
df = df.copy()
df['gearbox'] = df['gearbox'].str.lower().str.strip()
fill_strategies = [
['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
['brand', 'model', 'fueltype', 'vehicletype'],
['brand', 'model', 'fueltype'],
['brand', 'model', 'vehicletype'],
['brand', 'model'],
['brand']
]
total_filled = 0
start_missing = df['gearbox'].isna().sum()
for cols in fill_strategies:
# Count how many "auto" and "manual" in each group
group_counts = (
df.dropna(subset=['gearbox'])
.groupby(cols)['gearbox']
.value_counts(normalize=True)
.rename('ratio')
.reset_index()
)
# Keep only those where ratio >= threshold
group_confident = (
group_counts[group_counts['ratio'] >= threshold]
.drop_duplicates(subset=cols)
.rename(columns={'gearbox': 'fill_value'})
.drop(columns=['ratio'])
)
if group_confident.empty:
continue
df = df.merge(group_confident, on=cols, how='left', suffixes=('', '_fill'))
mask = df['gearbox'].isna() & df['fill_value'].notna()
filled_now = mask.sum()
df.loc[mask, 'gearbox'] = df.loc[mask, 'fill_value']
df.drop(columns='fill_value', inplace=True)
total_filled += filled_now
if verbose and filled_now > 0:
print(f"Filled {filled_now} missing gearbox values using {cols} (≥{threshold*100:.0f}% majority rule)")
if df['gearbox'].isna().sum() == 0:
break
if verbose:
end_missing = df['gearbox'].isna().sum()
print(f"\n✅ Gearbox filling complete: {start_missing - end_missing} filled, {end_missing} still missing.")
return df
df_app3g = fill_gearbox(df_app3)
Filled 7190 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥90% majority rule) Filled 831 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥90% majority rule) Filled 635 missing gearbox values using ['brand', 'model', 'fueltype'] (≥90% majority rule) Filled 783 missing gearbox values using ['brand', 'model', 'vehicletype'] (≥90% majority rule) Filled 1193 missing gearbox values using ['brand', 'model'] (≥90% majority rule) Filled 1670 missing gearbox values using ['brand'] (≥90% majority rule) ✅ Gearbox filling complete: 12302 filled, 6425 still missing.
cvt = (df_app3g['model'].isin(['corsa'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','bus']))
df_app3g.loc[cvt,'vehicletype'] = np.nan
gbus = (df_app3g['model'].isin(['golf'])) & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[gbus,['vehicletype']] = np.nan
puv = (df_app3g['model'].isin(['polo'])) & (df_app3g['vehicletype'].isin(['bus', 'suv']))
df_app3g.loc[puv,['vehicletype']] = np.nan
bmwsuv = (df_app3g['model'].isin(['3er'])) & (df_app3g['vehicletype'].isin(['bus', 'suv']))
df_app3g.loc[bmwsuv,['vehicletype']] = np.nan
astrabus = (df_app3g['model'].isin(['astra'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[astrabus,['vehicletype']] = np.nan
nosuv = (df_app3g['vehicletype'] == 'suv') & (df_app3g['model'].isin(['beetle','combo','transporter','vectra', 'verso','500','vito','a3','vivaro','a4','transit','m_reihe','astra','b_klasse','slk','corolla','corsa','doblo','r19','fabia','focus','picanto', 'omega', '147']))
df_app3g.loc[nosuv,['vehicletype']] = np.nan
noconvertible = (df_app3g['vehicletype'] == 'convertible') & (df_app3g['model'].isin(['ypsilon','100','passat','200','7er','90','a_klasse','antara','c2','calibra','forester','galaxy','glk','i3','kuga','nubira','zafira']))
df_app3g.loc[noconvertible,['vehicletype']] = np.nan
nocoupe = (df_app3g['vehicletype'] == 'coupe') & (df_app3g['model'].isin(['micra','aygo','9000','v70','a1','arosa','toledo','bora','ptcruiser','cx_reihe','seicento','getz','meriva','zafira']))
df_app3g.loc[nocoupe,['vehicletype']] = np.nan
nobus = (df_app3g['vehicletype'] == 'bus') & (df_app3g['model'].isin(['c5','civic','mondeo','astra','tucson','antara','a4','5er','4_reihe','x_trail','a6','sl','tigra','swift','micra','santa','forester','galant','justy','punto','panda','pajero','outlander','omega','m_klasse','mx_reihe','materia','lancer']))
df_app3g.loc[nobus,['vehicletype']] = np.nan
nowagon = (df_app3g['vehicletype'] == 'wagon') & (df_app3g['model'].isin(['jazz','calibra','200','getz','twingo','yeti','g_klasse','fox','arosa','clk','i3','musa','touareg','lanos','micra','a2','90','q3','lupo','santa','kappa','kalos','sl','niva','spark','slk']))
df_app3g.loc[nowagon,['vehicletype']] = np.nan
nosedan = (df_app3g['vehicletype'] == 'sedan') & (df_app3g['model'].isin(['v50','galaxy','z_reihe','s_max','materia','forester','tucson','move','cayenne','spider','sorento','cx_reihe','antara','rav','combo','cr_reihe']))
df_app3g.loc[nosedan,['vehicletype']] = np.nan
nosmall = (df_app3g['vehicletype'] == 'small') & (df_app3g['model'].isin(['doblo','verso','vivaro','6_reihe','defender','kuga','croma','m_reihe','grand','cayenne','rangerover','a6','sportage','accord','octavia','impreza','s_type','s_klasse','rx_reihe']))
df_app3g.loc[nosmall,['vehicletype']] = np.nan
noaudi = (df_app3g['model'] == 'audi')
df_app3g.loc[noaudi,['model']] = np.nan
notrab = (df_app3g['brand'] == 'trabant') & (df_app3g['model'] == '601') & (df_app3g['vehicletype'].isin(['coupe','suv']))
df_app3g.loc[notrab,['vehicletype']] = np.nan
nokiacoupe = (df_app3g['brand'].isin(['kia'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[nokiacoupe,['vehicletype']] = np.nan
daewc = (df_app3g['brand'].isin(['daewoo'])) & (df_app3g['model'] == 'lanos') & (df_app3g['vehicletype'].isin(['coupe','wagon']))
df_app3g.loc[daewc,['vehicletype']] = np.nan
lanc = (df_app3g['brand'] == 'lancia') & (df_app3g['model'].isin(['kappa','delta'])) & (df_app3g['vehicletype'] == 'coupe')
df_app3g.loc[lanc,['vehicletype']] = np.nan
alfa147 = (df_app3g['brand'] == 'alfa_romeo') & (df_app3g['model'] == '147') & ~(df_app3g['vehicletype'].isin(['small','other']))
df_app3g.loc[alfa147,['vehicletype']] = np.nan
rovernos = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['rangerover'])) & ~(df_app3g['vehicletype'].isin(['suv','other']))
df_app3g.loc[rovernos,['vehicletype']] = np.nan
ibizano = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['ibiza'])) & ~(df_app3g['vehicletype'].isin(['other','small','sedan']))
df_app3g.loc[ibizano,['vehicletype']] = np.nan
alteano = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['altea'])) & ~(df_app3g['vehicletype'].isin(['other','small']))
df_app3g.loc[alteano,['vehicletype']] = np.nan
focuscb = (df_app3g['brand'] == 'ford') & (df_app3g['model'] == 'focus') & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[focuscb,['vehicletype']] = np.nan
ccw = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'] == 'crossfire') & (df_app3g['vehicletype'] == 'wagon')
df_app3g.loc[ccw,['vehicletype']] = np.nan
slcs = (df_app3g['brand'] == 'seat') & (df_app3g['model'] == 'leon') & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[slcs,['vehicletype']] = np.nan
mcb = (df_app3g['brand'] == 'mazda') & (df_app3g['model'] == '3_reihe') & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[mcb,['vehicletype']] = np.nan
calc = (df_app3g['brand'] == 'opel') & (df_app3g['model'] == 'calibra') & (df_app3g['vehicletype'] != 'coupe')
df_app3g.loc[calc,['vehicletype']] = np.nan
hicsb = (df_app3g['brand'] == 'hyundai') & (df_app3g['model'] == 'i_reihe') & (df_app3g['vehicletype'].isin(['coupe','suv','bus']))
df_app3g.loc[hicsb,['vehicletype']] = np.nan
f500 = (df_app3g['brand'] == 'fiat') & (df_app3g['model'] == '500') & ~(df_app3g['vehicletype'].isin(['small','convertible']))
df_app3g.loc[f500,['vehicletype']] = np.nan
fpun = (df_app3g['brand'] == 'fiat') & (df_app3g['model'] == 'punto') & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[fpun,['vehicletype']] = np.nan
daian = (df_app3g['brand'] == 'daihatsu') & (df_app3g['model'] == 'terios') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[daian,['vehicletype']] = np.nan
ladan = (df_app3g['brand'] == 'lada') & (df_app3g['model'] == 'niva') & (df_app3g['vehicletype'].isin(['bus','sedan']))
df_app3g.loc[ladan,['vehicletype']] = np.nan
aq5 = (df_app3g['brand'] == 'audi') & (df_app3g['model'] == 'q5') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[aq5,['vehicletype']] = np.nan
aq7 = (df_app3g['brand'] == 'audi') & (df_app3g['model'] == 'q7') & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[aq7,['vehicletype']] = np.nan
dd = (df_app3g['brand'] == 'dacia') & (df_app3g['model'] == 'duster') & (df_app3g['vehicletype'].isin(['bus','wagon']))
df_app3g.loc[dd,['vehicletype']] = np.nan
tr = (df_app3g['brand'] == 'toyota') & (df_app3g['model'] == 'rav') & (df_app3g['vehicletype'].isin(['small','convertible']))
df_app3g.loc[tr,['vehicletype']] = np.nan
vxc = (df_app3g['brand'] == 'volvo') & (df_app3g['model'] == 'xc_reihe') & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[vxc,['vehicletype']] = np.nan
sandan = (df_app3g['brand'] == 'dacia') & (df_app3g['model'] == 'sandero') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[sandan,['vehicletype']] = np.nan
sju = (df_app3g['brand'] == 'subaru') & (df_app3g['model'] == 'justy') & (df_app3g['vehicletype'].isin(['suv','sedan','wagon']))
df_app3g.loc[sju,['vehicletype']] = np.nan
lym = (df_app3g['brand'] == 'lancia') & (df_app3g['model'].isin(['ypsilon','musa'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[lym,['vehicletype']] = np.nan
dmat = (df_app3g['brand'] == 'daewoo') & (df_app3g['model'].isin(['matiz'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[dmat,['vehicletype']] = np.nan
tay = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['yaris'])) & (df_app3g['vehicletype'].isin(['bus','wagon']))
df_app3g.loc[tay,['vehicletype']] = np.nan
tayr = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['aygo','auris'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[tayr,['vehicletype']] = np.nan
tus = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['corolla'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[tus,['vehicletype']] = np.nan
coops = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['cooper'])) & (df_app3g['vehicletype'].isin(['suv','wagon','bus']))
df_app3g.loc[coops,['vehicletype']] = np.nan
coops = (df_app3g['brand'] == 'mini') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[coops,['vehicletype']] = np.nan
mone = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['one'])) & (df_app3g['vehicletype'].isin(['suv','sedan']))
df_app3g.loc[mone,['vehicletype']] = np.nan
clubmn = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['clubman'])) & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[clubmn,['vehicletype']] = np.nan
suzsw = (df_app3g['brand'] == 'suzuki') & (df_app3g['model'].isin(['swift'])) & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[suzsw,['vehicletype']] = np.nan
cit12 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c1','c2'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[cit12,['vehicletype']] = np.nan
cit4 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c4'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[cit4,['vehicletype']] = np.nan
cit3 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c3'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','bus']))
df_app3g.loc[cit3,['vehicletype']] = np.nan
kr = (df_app3g['brand'] == 'kia') & (df_app3g['model'].isin(['rio'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[kr,['vehicletype']] = np.nan
cs = (df_app3g['brand'] == 'chevrolet') & (df_app3g['model'].isin(['spark'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[cs,['vehicletype']] = np.nan
p2 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['2_reihe'])) & (df_app3g['vehicletype'].isin(['suv','bus']))
df_app3g.loc[p2,['vehicletype']] = np.nan
p1 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['1_reihe'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','convertible']))
df_app3g.loc[p1,['vehicletype']] = np.nan
p3 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['3_reihe'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[p3,['vehicletype']] = np.nan
hg = (df_app3g['brand'] == 'hyundai') & (df_app3g['model'].isin(['getz'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[hg,['vehicletype']] = np.nan
oc = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['corsa'])) & (df_app3g['vehicletype'].isin(['coupe','convertible']))
df_app3g.loc[oc,['vehicletype']] = np.nan
oa = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['agila'])) & (df_app3g['vehicletype'].isin(['bus','wagon','sedan']))
df_app3g.loc[oa,['vehicletype']] = np.nan
omer = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['meriva'])) & (df_app3g['vehicletype'].isin(['bus','suv','sedan']))
df_app3g.loc[omer,['vehicletype']] = np.nan
ok = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['kadett'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[ok,['vehicletype']] = np.nan
oz = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['zafira'])) & (df_app3g['vehicletype'].isin(['suv','sedan']))
df_app3g.loc[oz,['vehicletype']] = np.nan
hj = (df_app3g['brand'] == 'honda') & (df_app3g['model'].isin(['jazz'])) & (df_app3g['vehicletype'].isin(['bus','coupe','sedan']))
df_app3g.loc[hj,['vehicletype']] = np.nan
mak = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['a_klasse'])) & (df_app3g['vehicletype'].isin(['bus','suv','wagon','coupe']))
df_app3g.loc[mak,['vehicletype']] = np.nan
mbk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['b_klasse'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[mbk,['vehicletype']] = np.nan
mck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['clk'])) & (df_app3g['vehicletype'].isin(['sedan','small','suv']))
df_app3g.loc[mck,['vehicletype']] = np.nan
msk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['sprinter'])) & (df_app3g['vehicletype'].isin(['sedan','small']))
df_app3g.loc[msk,['vehicletype']] = np.nan
mvk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['viano'])) & (df_app3g['vehicletype'].isin(['sedan','small']))
df_app3g.loc[mvk,['vehicletype']] = np.nan
mvtk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['vito'])) & (df_app3g['vehicletype'].isin(['small']))
df_app3g.loc[mvtk,['vehicletype']] = np.nan
nn = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['note'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[nn,['vehicletype']] = np.nan
ff = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['fiesta'])) & (df_app3g['vehicletype'].isin(['bus','convertible']))
df_app3g.loc[ff,['vehicletype']] = np.nan
fk = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['ka'])) & (df_app3g['vehicletype'].isin(['coupe','wagon','convertible']))
df_app3g.loc[fk,['vehicletype']] = np.nan
ffu = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['fusion'])) & (df_app3g['vehicletype'].isin(['wagon','bus']))
df_app3g.loc[ffu,['vehicletype']] = np.nan
ffo = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['focus'])) & (df_app3g['vehicletype'].isin(['bus','suv','coupe','convertible']))
df_app3g.loc[ffo,['vehicletype']] = np.nan
fe = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['escort'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[fe,['vehicletype']] = np.nan
fm = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['mondeo'])) & (df_app3g['vehicletype'].isin(['small','coupe']))
df_app3g.loc[fm,['vehicletype']] = np.nan
sf2 = (df_app3g['brand'] == 'smart') & (df_app3g['model'].isin(['fortwo'])) & (df_app3g['vehicletype'].isin(['bus','sedan']))
df_app3g.loc[sf2,['vehicletype']] = np.nan
sf4 = (df_app3g['brand'] == 'smart') & (df_app3g['model'].isin(['forfour'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','coupe','convertible']))
df_app3g.loc[sf4,['vehicletype']] = np.nan
sf4 = (df_app3g['brand'] == 'smart') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[sf4,['vehicletype']] = np.nan
fsed = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['panda','seicento'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[fsed,['vehicletype']] = np.nan
sleo = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['leon'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[sleo,['vehicletype']] = np.nan
sm = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['mii'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[sm,['vehicletype']] = np.nan
rc = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['clio'])) & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[rc,['vehicletype']] = np.nan
rt = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['twingo'])) & (df_app3g['vehicletype'].isin(['sedan','coupe','convertible']))
df_app3g.loc[rt,['vehicletype']] = np.nan
rm = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['modus'])) & (df_app3g['vehicletype'].isin(['sedan','bus','wagon']))
df_app3g.loc[rm,['vehicletype']] = np.nan
rme = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['megane'])) & (df_app3g['vehicletype'].isin(['suv','bus','small']))
df_app3g.loc[rme,['vehicletype']] = np.nan
rk = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['kangoo'])) & (df_app3g['vehicletype'].isin(['suv','sedan','small']))
df_app3g.loc[rk,['vehicletype']] = np.nan
skf = (df_app3g['brand'] == 'skoda') & (df_app3g['model'].isin(['fabia'])) & (df_app3g['vehicletype'].isin(['bus','convertible']))
df_app3g.loc[skf,['vehicletype']] = np.nan
skc = (df_app3g['brand'] == 'skoda') & (df_app3g['model'].isin(['citigo'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[skc,['vehicletype']] = np.nan
vwp = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['polo'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','convertible']))
df_app3g.loc[vwp,['vehicletype']] = np.nan
vwu = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['up'])) & (df_app3g['vehicletype'].isin(['sedan','suv']))
df_app3g.loc[vwu,['vehicletype']] = np.nan
vwg = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['golf','passat'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[vwg,['vehicletype']] = np.nan
vwb = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['beetle'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[vwb,['vehicletype']] = np.nan
vwc = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['caddy'])) & (df_app3g['vehicletype'].isin(['small','suv','sedan','convertible']))
df_app3g.loc[vwc,['vehicletype']] = np.nan
vwf = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['fox'])) & (df_app3g['vehicletype'].isin(['coupe','convertible']))
df_app3g.loc[vwf,['vehicletype']] = np.nan
vwl = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['lupo'])) & (df_app3g['vehicletype'].isin(['coupe','convertible', 'bus','sedan']))
df_app3g.loc[vwl,['vehicletype']] = np.nan
vws = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['scirocco'])) & (df_app3g['vehicletype'].isin(['small','convertible','sedan']))
df_app3g.loc[vws,['vehicletype']] = np.nan
vwt = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['touran'])) & (df_app3g['vehicletype'].isin(['small','convertible','sedan','suv','wagon']))
df_app3g.loc[vwt,['vehicletype']] = np.nan
vwj = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['jetta'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[vwj,['vehicletype']] = np.nan
vwsh = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['sharan'])) & (df_app3g['vehicletype'].isin(['small','wagon','sedan','suv']))
df_app3g.loc[vwsh,['vehicletype']] = np.nan
vwtrans = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['transporter'])) & (df_app3g['vehicletype'].isin(['small','sedan','wagon']))
df_app3g.loc[vwtrans,['vehicletype']] = np.nan
bmwx = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['x_reihe'])) & (df_app3g['vehicletype'].isin(['wagon','sedan','bus']))
df_app3g.loc[bmwx,['vehicletype']] = np.nan
b5 = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['5er'])) & (df_app3g['vehicletype'].isin(['small','suv']))
df_app3g.loc[b5,['vehicletype']] = np.nan
b1 = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['1er'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[b1,['vehicletype']] = np.nan
maz3 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['3_reihe'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','convertible']))
df_app3g.loc[maz3,['vehicletype']] = np.nan
maz6 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['6_reihe'])) & (df_app3g['vehicletype'].isin(['coupe','convertible','bus','small']))
df_app3g.loc[maz6,['vehicletype']] = np.nan
mbck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['c_klasse'])) & (df_app3g['vehicletype'].isin(['bus','small','other']))
df_app3g.loc[mbck,['vehicletype']] = np.nan
mbck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'] == 'c_klasse') & (df_app3g['registrationyear'] == 2001) & (df_app3g['power'] == 122) & (df_app3g['fueltype'] == 'gasoline') & (df_app3g['mileage'] == 150000) & (df_app3g['price'] > 1799) & (df_app3g['price'] < 3501)
df_app3g.loc[mbck,['vehicletype']] = 'sedan'
mbek = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['e_klasse'])) & (df_app3g['vehicletype'].isin(['bus','small','suv']))
df_app3g.loc[mbek,['vehicletype']] = np.nan
mbsk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['s_klasse'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[mbsk,['vehicletype']] = np.nan
mbcs = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['cl','sl'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[mbcs,['vehicletype']] = np.nan
mbglk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['glk'])) & (df_app3g['vehicletype'].isin(['sedan','coupe']))
df_app3g.loc[mbglk,['vehicletype']] = np.nan
vwbor = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['bora'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[vwbor,['vehicletype']] = np.nan
aa4 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a4'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[aa4,['vehicletype']] = np.nan
aa6 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a6'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[aa6,['vehicletype']] = np.nan
aa8 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a8'])) & (df_app3g['vehicletype'].isin(['small','wagon']))
df_app3g.loc[aa8,['vehicletype']] = np.nan
aa5 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a5'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[aa5,['vehicletype']] = np.nan
aa1 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a1','q3'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[aa1,['vehicletype']] = np.nan
fc = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['c_max'])) & (df_app3g['vehicletype'].isin(['sedan','bus','suv']))
df_app3g.loc[fc,['vehicletype']] = np.nan
fm = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['mustang'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[fm,['vehicletype']] = np.nan
ov = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['vectra'])) & (df_app3g['vehicletype'].isin(['small','bus','convertible']))
df_app3g.loc[ov,['vehicletype']] = np.nan
os = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['signum'])) & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[os,['vehicletype']] = np.nan
omega = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['omega'])) & (df_app3g['vehicletype'].isin(['small']))
df_app3g.loc[omega,['vehicletype']] = np.nan
p5 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['5_reihe'])) & (df_app3g['vehicletype'].isin(['coupe','small','convertible']))
df_app3g.loc[p5,['vehicletype']] = np.nan
p4 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['4_reihe'])) & (df_app3g['vehicletype'].isin(['suv','small']))
df_app3g.loc[p4,['vehicletype']] = np.nan
rlag = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['laguna'])) & (df_app3g['vehicletype'].isin(['coupe','small','convertible']))
df_app3g.loc[rlag,['vehicletype']] = np.nan
rsc = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['scenic'])) & (df_app3g['vehicletype'].isin(['suv','sedan','bus']))
df_app3g.loc[rsc,['vehicletype']] = np.nan
ml = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['lancer'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[ml,['vehicletype']] = np.nan
mco = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['colt'])) & (df_app3g['vehicletype'].isin(['suv','wagon','bus','sedan']))
df_app3g.loc[mco,['vehicletype']] = np.nan
mout = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['outlander'])) & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[mout,['vehicletype']] = np.nan
cc5 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c5'])) & (df_app3g['vehicletype'].isin(['small','bus']))
df_app3g.loc[cc5,['vehicletype']] = np.nan
st = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['toledo'])) & (df_app3g['vehicletype'].isin(['small','bus']))
df_app3g.loc[st,['vehicletype']] = np.nan
tv = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['verso'])) & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[tv,['vehicletype']] = np.nan
ta = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['avensis'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[ta,['vehicletype']] = np.nan
vv40 = (df_app3g['brand'] == 'volvo') & (df_app3g['model'].isin(['v40'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[vv40,['vehicletype']] = np.nan
vcr = (df_app3g['brand'] == 'volvo') & (df_app3g['model'].isin(['c_reihe'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[vcr,['vehicletype']] = np.nan
fbrav = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['bravo'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','coupe']))
df_app3g.loc[fbrav,['vehicletype']] = np.nan
c300 = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'].isin(['300c'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[c300,['vehicletype']] = np.nan
dand = (df_app3g['brand'] == 'dacia') & (df_app3g['model'].isin(['logan'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[dand,['vehicletype']] = np.nan
land = (df_app3g['brand'] == 'lancia') & (df_app3g['model'].isin(['delta']))
df_app3g.loc[land,['vehicletype']] = 'other'
rdef = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['defender']))
df_app3g.loc[rdef,['vehicletype']] = 'suv'
jb = (df_app3g['brand'] == 'jeep') & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[jb,['vehicletype']] = np.nan
rdisc = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['discovery'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[rdisc,['vehicletype']] = np.nan
norover = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['defender','freelander','discovery','rangerover']))
df_app3g.loc[norover,['brand']] = 'land_rover'
lrfree = (df_app3g['brand'] == 'land_rover') & (df_app3g['model'].isin(['freelander'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[lrfree,['vehicletype']] = np.nan
nq = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['qashqai'])) & (df_app3g['vehicletype'].isin(['sedan','bus','wagon']))
df_app3g.loc[nq,['vehicletype']] = np.nan
nq = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['qashqai'])) & (df_app3g['vehicletype'].isna())
df_app3g.loc[nq,['vehicletype']] = 'suv'
nnav = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['navara'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[nnav,['vehicletype']] = np.nan
nnav = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['navara'])) & (df_app3g['vehicletype'].isna())
df_app3g.loc[nnav,['vehicletype']] = 'suv'
hcr = (df_app3g['brand'] == 'honda') & (df_app3g['model'].isin(['cr_reihe'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[hcr,['vehicletype']] = np.nan
mcon = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['5_reihe','cx_reihe','1_reihe'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[mcon,['vehicletype']] = np.nan
maz5 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['5_reihe'])) & (df_app3g['vehicletype'].isin(['suv','wagon','sedan']))
df_app3g.loc[maz5,['vehicletype']] = np.nan
cit3 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c3'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[cit3,['vehicletype']] = np.nan
fvert = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['punto','panda'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[fvert,['vehicletype']] = np.nan
zuvert = (df_app3g['brand'] == 'suzuki') & (df_app3g['model'].isin(['swift','grand'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[zuvert,['vehicletype']] = np.nan
mgk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['g_klasse'])) & (df_app3g['vehicletype'].isin(['convertible','sedan']))
df_app3g.loc[mgk,['vehicletype']] = np.nan
arsp = (df_app3g['brand'] == 'alfa_romeo') & (df_app3g['model'].isin(['spider'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[arsp,['vehicletype']] = np.nan
toua = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'] == 'tiguan') & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[toua,['vehicletype']] = np.nan
toua = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'] == 'tiguan') & (df_app3g['vehicletype'].isna())
df_app3g.loc[toua,['vehicletype']] = 'suv'
ptc = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'] == 'ptcruiser') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[ptc,['vehicletype']] = np.nan
del toua, jb, rdisc, norover, lrfree, nq, nnav, hcr, mcon, maz5, fvert, zuvert, mgk, arsp, bmwx, b5, b1, maz3, maz6, mbck, mbek, mbsk, mbcs, mbglk, vwbor, aa4, aa6
del aa8, aa5, aa1, fc, fm, ov, os, omega, p5, p4, rlag, rsc, ml, mco, mout, cc5, st, tv, ta, vv40, vcr, fbrav, c300, dand, land, rdef, fsed, sleo, sm, rc, rt, rm, rme
del rk, skf, skc, vwp, vwu, vwg, vwb, vwc, vwf, vwl, vws, vwt, vwj, vwsh, vwtrans, kr, cs, p2, p1, p3, hg, oc, oa, omer, ok, oz, hj, mak, mbk, mck, msk, mvk, mvtk, nn
del ff, fk, ffu, ffo, fe, sf2, sf4, cit4, cit3, cit12, suzsw, clubmn, mone, coops, tus, tayr, tay, dmat, lym, sju, sandan, vxc, tr, dd, aq7, aq5, ladan, daian, fpun, f500
del hicsb, calc, mcb, slcs, ccw, focuscb, alteano, ibizano, rovernos, alfa147, lanc, daewc, nokiacoupe, notrab,ptc
gc.collect()
0
df_app3g['pc_bin'] = df_app3g['postalcode'].astype(str).str[0]
def fill_all_missing_values(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
"""
Fill missing values for power, vehicletype, model, fueltype using tiered group strategies.
Optimized version with better memory management and early stopping.
"""
df = df.copy()
def safe_mode(series):
"""Return mode if confident enough (>= threshold), else NaN."""
s = series.dropna()
if len(s) == 0:
return np.nan
counts = s.value_counts(normalize=True)
if len(counts) == 0:
return np.nan
top_val, top_freq = counts.index[0], counts.iloc[0]
return top_val if top_freq >= threshold else np.nan
def is_zero_condition(condition):
"""Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
try:
test = condition(pd.Series([0, np.nan], dtype=object))
if isinstance(test, (bool, np.bool_)) and test:
return True
if hasattr(test, "__len__") and len(test) >= 1:
return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
except Exception:
pass
return False
def make_key_tuple(row_vals):
"""Helper: convert list-like row values to a hashable tuple with None for NaN."""
return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)
def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
total_filled = 0
zero_check = is_zero_condition(condition)
# Track initial state
if zero_check:
initial_missing = (df[target_col] == 0).sum()
else:
initial_missing = df[target_col].isna().sum()
if initial_missing == 0:
return 0
if verbose:
print(f" → Starting with {initial_missing:,} missing values in '{target_col}'")
for cols in fill_strategies:
# Check if there's still work to do
if zero_check:
current_missing = (df[target_col] == 0).sum()
else:
current_missing = df[target_col].isna().sum()
if current_missing == 0:
break
start_time = time.time()
try:
# Compute group modes using safe_mode
group_modes = (
df.groupby(cols, dropna=False)[target_col]
.apply(safe_mode)
.reset_index()
.rename(columns={target_col: 'fill_value'})
)
# Remove groups with no valid fill value
group_modes = group_modes[group_modes['fill_value'].notna()]
if len(group_modes) == 0:
continue
except Exception as e:
if verbose:
print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
continue
# Build mapping dict from group_modes
keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
mapping = dict(zip(keys, group_modes['fill_value'].values))
# Compute fill_value per-row by mapping (keeps original row order)
row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
fill_series = row_keys.map(mapping)
# Create mask of rows that need filling AND have a candidate fill_value
mask_need = condition(df[target_col])
mask_candidate = fill_series.notna()
mask = mask_need & mask_candidate
# Count before
if zero_check:
before_missing = (df[target_col] == 0).sum()
else:
before_missing = df[target_col].isna().sum()
# Perform fill
if mask.any():
df.loc[mask, target_col] = fill_series.loc[mask].values
# Count after
if zero_check:
after_missing = (df[target_col] == 0).sum()
else:
after_missing = df[target_col].isna().sum()
filled_now = before_missing - after_missing
total_filled += int(filled_now)
if verbose and filled_now > 0:
elapsed = time.time() - start_time
print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")
return total_filled
iteration = 0
while iteration < max_iterations:
iteration += 1
total_filled = 0
if verbose:
print(f"\n🌀 Iteration {iteration} starting...")
# --- POWER ---
power_strategies = [
['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype', 'year_bin'],
['brand', 'model', 'vehicletype', 'registrationyear'],
['brand', 'model', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype'],
['brand', 'model', 'fueltype', 'vehicletype'],
['brand', 'model', 'fueltype', 'year_bin'],
['brand', 'model', 'fueltype', 'registrationyear'],
['brand', 'model', 'fueltype', 'gearbox'],
['brand','model','vehicletype','fueltype','year_bin','pc_bin'],
['brand','model','vehicletype','fueltype','registrationyear','pc_bin'],
['brand','model','vehicletype','fueltype','gearbox','pc_bin'],
['brand','model','vehicletype','fueltype','pc_bin'],
['brand','model','vehicletype','pc_bin'],
['brand', 'model', 'year_bin'],
['brand', 'model', 'registrationyear'],
['brand', 'model'],
['brand', 'vehicletype'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)
# --- VEHICLE TYPE ---
vehicletype_strategies = [
['brand', 'model', 'power', 'year_bin'],
['brand', 'model', 'power', 'registrationyear'],
['brand', 'model', 'power', 'gearbox'],
['brand', 'model', 'year_bin'],
['brand', 'model', 'registrationyear'],
['brand', 'model', 'power'],
['brand', 'model', 'gearbox'],
['brand', 'model'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'power'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('vehicletype', vehicletype_strategies)
# --- MODEL ---
model_strategies = [
['brand', 'vehicletype', 'power', 'year_bin'],
['brand', 'vehicletype', 'power', 'registrationyear'],
['brand', 'vehicletype', 'power', 'gearbox'],
['brand', 'vehicletype', 'year_bin'],
['brand', 'vehicletype', 'registrationyear'],
['brand', 'vehicletype', 'power'],
['brand', 'vehicletype', 'gearbox'],
['brand', 'vehicletype'],
['brand', 'power'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('model', model_strategies)
# --- FUELTYPE ---
fueltype_strategies = [
['brand', 'model', 'vehicletype', 'power', 'year_bin'],
['brand', 'model', 'vehicletype', 'power', 'registrationyear'],
['brand', 'model', 'vehicletype', 'power', 'gearbox'],
['brand', 'model', 'vehicletype', 'year_bin'],
['brand', 'model', 'vehicletype', 'registrationyear'],
['brand', 'model', 'vehicletype', 'gearbox'],
['brand', 'model', 'power', 'year_bin'],
['brand', 'model', 'power', 'registrationyear'],
['brand', 'model', 'power', 'gearbox'],
['brand', 'model', 'power'],
['brand', 'model'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('fueltype', fueltype_strategies)
if verbose:
print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")
if not repeat_until_change or total_filled == 0:
if verbose:
print("🏁 No further changes detected, stopping.")
break
return df
df_app3g = df_app3g[df_app3g['price'] > 99].copy()
gc.collect()
0
df_app3g = fill_gearbox(df_app3g, threshold = 0.75)
Filled 1218 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥75% majority rule) Filled 272 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥75% majority rule) Filled 428 missing gearbox values using ['brand', 'model', 'fueltype'] (≥75% majority rule) Filled 391 missing gearbox values using ['brand', 'model', 'vehicletype'] (≥75% majority rule) Filled 860 missing gearbox values using ['brand', 'model'] (≥75% majority rule) Filled 638 missing gearbox values using ['brand'] (≥75% majority rule) ✅ Gearbox filling complete: 3807 filled, 1317 still missing.
gc.collect()
df_app = fill_all_missing_values(df_app3g, threshold = 0.75)
🌀 Iteration 1 starting... → Starting with 33,126 missing values in 'power' ✅ Filled 145 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (32,981 remaining, took 5.30s) ✅ Filled 82 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (32,899 remaining, took 11.67s) ✅ Filled 119 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (32,780 remaining, took 4.79s) ✅ Filled 67 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (32,713 remaining, took 3.70s) ✅ Filled 47 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (32,666 remaining, took 7.99s) ✅ Filled 194 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (32,472 remaining, took 3.39s) ✅ Filled 145 values in 'power' using ['brand', 'model', 'vehicletype'] (32,327 remaining, took 2.95s) ✅ Filled 11 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (32,316 remaining, took 4.05s) ✅ Filled 31 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (32,285 remaining, took 3.58s) ✅ Filled 119 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (32,166 remaining, took 7.59s) ✅ Filled 34 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (32,132 remaining, took 3.35s) ✅ Filled 171 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (31,961 remaining, took 13.71s) ✅ Filled 402 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (31,559 remaining, took 31.98s) ✅ Filled 73 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (31,486 remaining, took 12.00s) ✅ Filled 9 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'] (31,477 remaining, took 9.60s) ✅ Filled 143 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (31,334 remaining, took 5.84s) ✅ Filled 19 values in 'power' using ['brand', 'model', 'year_bin'] (31,315 remaining, took 2.76s) ✅ Filled 20 values in 'power' using ['brand', 'model', 'registrationyear'] (31,295 remaining, took 4.97s) ✅ Filled 22 values in 'power' using ['brand', 'model'] (31,273 remaining, took 2.33s) ✅ Filled 19 values in 'power' using ['brand', 'vehicletype'] (31,254 remaining, took 2.43s) ✅ Filled 33 values in 'power' using ['brand', 'year_bin'] (31,221 remaining, took 2.26s) ✅ Filled 1 values in 'power' using ['brand', 'registrationyear'] (31,220 remaining, took 3.35s) → Starting with 29,706 missing values in 'vehicletype' ✅ Filled 10,490 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (19,216 remaining, took 10.23s) ✅ Filled 2,507 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (16,709 remaining, took 20.24s) ✅ Filled 3,187 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (13,522 remaining, took 9.26s) ✅ Filled 1,491 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (12,031 remaining, took 2.72s) ✅ Filled 1,362 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (10,669 remaining, took 5.28s) ✅ Filled 201 values in 'vehicletype' using ['brand', 'model', 'power'] (10,468 remaining, took 7.91s) ✅ Filled 213 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (10,255 remaining, took 2.77s) ✅ Filled 102 values in 'vehicletype' using ['brand', 'model'] (10,153 remaining, took 2.47s) ✅ Filled 21 values in 'vehicletype' using ['brand', 'year_bin'] (10,132 remaining, took 2.34s) ✅ Filled 104 values in 'vehicletype' using ['brand', 'registrationyear'] (10,028 remaining, took 3.45s) ✅ Filled 792 values in 'vehicletype' using ['brand', 'power'] (9,236 remaining, took 4.78s) → Starting with 7,542 missing values in 'model' ✅ Filled 2,557 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (4,985 remaining, took 10.86s) ✅ Filled 1,174 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (3,811 remaining, took 22.39s) ✅ Filled 707 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,104 remaining, took 9.39s) ✅ Filled 475 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (2,629 remaining, took 2.78s) ✅ Filled 330 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (2,299 remaining, took 5.55s) ✅ Filled 58 values in 'model' using ['brand', 'vehicletype', 'power'] (2,241 remaining, took 7.87s) ✅ Filled 57 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (2,184 remaining, took 2.76s) ✅ Filled 103 values in 'model' using ['brand', 'power'] (2,081 remaining, took 4.71s) ✅ Filled 4 values in 'model' using ['brand', 'year_bin'] (2,077 remaining, took 2.31s) → Starting with 12,755 missing values in 'fueltype' ✅ Filled 7,086 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (5,669 remaining, took 13.69s) ✅ Filled 739 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'registrationyear'] (4,930 remaining, took 26.42s) ✅ Filled 1,779 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (3,151 remaining, took 12.61s) ✅ Filled 1,258 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'year_bin'] (1,893 remaining, took 3.72s) ✅ Filled 246 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (1,647 remaining, took 8.20s) ✅ Filled 181 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'gearbox'] (1,466 remaining, took 3.40s) ✅ Filled 173 values in 'fueltype' using ['brand', 'model', 'power', 'year_bin'] (1,293 remaining, took 10.56s) ✅ Filled 182 values in 'fueltype' using ['brand', 'model', 'power', 'registrationyear'] (1,111 remaining, took 20.98s) ✅ Filled 119 values in 'fueltype' using ['brand', 'model', 'power', 'gearbox'] (992 remaining, took 9.22s) ✅ Filled 37 values in 'fueltype' using ['brand', 'model', 'power'] (955 remaining, took 7.63s) ✅ Filled 417 values in 'fueltype' using ['brand', 'model'] (538 remaining, took 2.56s) ✅ Filled 16 values in 'fueltype' using ['brand', 'year_bin'] (522 remaining, took 2.29s) ✅ Filled 41 values in 'fueltype' using ['brand', 'registrationyear'] (481 remaining, took 3.58s) ✅ Filled 12 values in 'fueltype' using ['brand', 'gearbox'] (469 remaining, took 2.35s) 🔁 Iteration 1 filled 40,127 total values 🌀 Iteration 2 starting... → Starting with 31,220 missing values in 'power' ✅ Filled 19 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (31,201 remaining, took 4.40s) ✅ Filled 32 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (31,169 remaining, took 10.35s) ✅ Filled 5 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (31,164 remaining, took 4.10s) ✅ Filled 51 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (31,113 remaining, took 3.51s) ✅ Filled 9 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (31,104 remaining, took 7.61s) ✅ Filled 55 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (31,049 remaining, took 3.23s) ✅ Filled 94 values in 'power' using ['brand', 'model', 'vehicletype'] (30,955 remaining, took 2.84s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (30,945 remaining, took 3.52s) ✅ Filled 41 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (30,904 remaining, took 3.32s) ✅ Filled 5 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (30,899 remaining, took 6.84s) ✅ Filled 214 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (30,685 remaining, took 3.16s) ✅ Filled 48 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (30,637 remaining, took 11.60s) ✅ Filled 63 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (30,574 remaining, took 29.69s) ✅ Filled 28 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (30,546 remaining, took 9.93s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'] (30,543 remaining, took 7.62s) ✅ Filled 159 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (30,384 remaining, took 5.35s) ✅ Filled 195 values in 'power' using ['brand', 'model', 'year_bin'] (30,189 remaining, took 2.63s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'registrationyear'] (30,188 remaining, took 4.91s) ✅ Filled 174 values in 'power' using ['brand', 'model'] (30,014 remaining, took 2.38s) ✅ Filled 1 values in 'power' using ['brand', 'vehicletype'] (30,013 remaining, took 2.35s) → Starting with 9,236 missing values in 'vehicletype' ✅ Filled 362 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (8,874 remaining, took 10.45s) ✅ Filled 387 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (8,487 remaining, took 20.92s) ✅ Filled 127 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (8,360 remaining, took 9.14s) ✅ Filled 1,726 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (6,634 remaining, took 2.80s) ✅ Filled 91 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (6,543 remaining, took 5.33s) ✅ Filled 10 values in 'vehicletype' using ['brand', 'model', 'power'] (6,533 remaining, took 7.72s) ✅ Filled 2,456 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (4,077 remaining, took 2.75s) ✅ Filled 574 values in 'vehicletype' using ['brand', 'model'] (3,503 remaining, took 2.42s) → Starting with 2,077 missing values in 'model' ✅ Filled 13 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (2,064 remaining, took 10.89s) ✅ Filled 410 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (1,654 remaining, took 22.37s) ✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (1,652 remaining, took 9.36s) ✅ Filled 8 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (1,644 remaining, took 2.78s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power'] (1,643 remaining, took 7.83s) ✅ Filled 1 values in 'model' using ['brand', 'power'] (1,642 remaining, took 4.75s) → Starting with 469 missing values in 'fueltype' ✅ Filled 9 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (460 remaining, took 14.02s) ✅ Filled 3 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (457 remaining, took 12.62s) ✅ Filled 29 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'year_bin'] (428 remaining, took 3.65s) ✅ Filled 6 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (422 remaining, took 8.18s) ✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'power', 'gearbox'] (421 remaining, took 9.13s) ✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'power'] (420 remaining, took 7.71s) ✅ Filled 38 values in 'fueltype' using ['brand', 'model'] (382 remaining, took 2.47s) 🔁 Iteration 2 filled 7,462 total values 🌀 Iteration 3 starting... → Starting with 30,013 missing values in 'power' ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (30,012 remaining, took 4.44s) ✅ Filled 5 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (30,007 remaining, took 10.21s) ✅ Filled 77 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (29,930 remaining, took 3.52s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (29,929 remaining, took 7.50s) ✅ Filled 407 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (29,522 remaining, took 3.23s) ✅ Filled 81 values in 'power' using ['brand', 'model', 'vehicletype'] (29,441 remaining, took 2.80s) ✅ Filled 64 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (29,377 remaining, took 3.40s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (29,375 remaining, took 11.38s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (29,372 remaining, took 29.55s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (29,371 remaining, took 9.77s) ✅ Filled 21 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (29,350 remaining, took 5.34s) ✅ Filled 63 values in 'power' using ['brand', 'model', 'year_bin'] (29,287 remaining, took 2.67s) ✅ Filled 158 values in 'power' using ['brand', 'model'] (29,129 remaining, took 2.32s) → Starting with 3,503 missing values in 'vehicletype' ✅ Filled 86 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (3,417 remaining, took 10.50s) ✅ Filled 158 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (3,259 remaining, took 20.85s) ✅ Filled 10 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (3,249 remaining, took 2.78s) ✅ Filled 21 values in 'vehicletype' using ['brand', 'model', 'power'] (3,228 remaining, took 7.61s) → Starting with 1,642 missing values in 'model' ✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (1,640 remaining, took 10.93s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (1,639 remaining, took 22.24s) → Starting with 382 missing values in 'fueltype' ✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (381 remaining, took 14.10s) 🔁 Iteration 3 filled 1,163 total values 🌀 Iteration 4 starting... → Starting with 29,129 missing values in 'power' ✅ Filled 5 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (29,124 remaining, took 10.08s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (29,122 remaining, took 4.09s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (29,120 remaining, took 3.51s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (29,119 remaining, took 29.35s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (29,118 remaining, took 9.85s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (29,117 remaining, took 5.28s) → Starting with 3,228 missing values in 'vehicletype' → Starting with 1,639 missing values in 'model' → Starting with 381 missing values in 'fueltype' 🔁 Iteration 4 filled 12 total values 🌀 Iteration 5 starting... → Starting with 29,117 missing values in 'power' → Starting with 3,228 missing values in 'vehicletype' → Starting with 1,639 missing values in 'model' → Starting with 381 missing values in 'fueltype' 🔁 Iteration 5 filled 0 total values 🏁 No further changes detected, stopping.
display(df_app[(df_app['brand'] == 'citroen') & (df_app['model'] == 'c4') & (df_app['vehicletype'] == 'bus')])
mask = (df_app['brand'] == 'citroen') & (df_app['model'] == 'c4') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
display(df_app[(df_app['brand'] == 'renault') & (df_app['model'] == 'megane') & (df_app['vehicletype'] == 'bus')])
mask = (df_app['brand'] == 'renault') & (df_app['model'] == 'megane') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
display(df_app[(df_app['brand'] == 'seat') & (df_app['model'] == 'leon') & (df_app['vehicletype'] == 'bus')])
mask = (df_app['brand'] == 'seat') & (df_app['model'] == 'leon') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
del mask, df_app3g
gc.collect()
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 127610 | 23/03/2016 20:47 | 4299 | bus | 2007.0 | manual | 139.0 | c4 | 150000 | 1 | petrol | citroen | NaN | 2016-03-23 | 0 | 81249 | 06/04/2016 02:16 | N | 2000s | 8 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 211628 | 07/03/2016 17:47 | 15000 | bus | 2016.0 | manual | 190.0 | megane | 50000 | 12 | petrol | renault | no | 2016-07-03 | 0 | 86875 | 06/04/2016 10:44 | N | 2010_plus | 8 |
| 253296 | 28/03/2016 14:36 | 900 | bus | 1998.0 | auto | 117.0 | megane | 125000 | 4 | petrol | renault | no | 2016-03-28 | 0 | 42857 | 02/04/2016 18:45 | N | 1990s | 4 |
| 332972 | 12/03/2016 13:46 | 3650 | bus | 2002.0 | manual | 96.0 | megane | 100000 | 6 | gasoline | renault | no | 2016-12-03 | 0 | 58455 | 12/03/2016 13:46 | N | 2000s | 5 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18876 | 02/04/2016 23:54 | 5000 | bus | 2006.0 | manual | 139.0 | leon | 100000 | 2 | petrol | seat | no | 2016-02-04 | 0 | 67063 | 07/04/2016 07:16 | N | 2000s | 6 |
0
print(df_app.memory_usage(deep=True).sum() / 1_000_000, "MB")
258.344887 MB
df_app1 = fill_gearbox(df_app, threshold = 0.75)
Filled 349 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥75% majority rule) Filled 27 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥75% majority rule) Filled 1 missing gearbox values using ['brand', 'model', 'fueltype'] (≥75% majority rule) Filled 24 missing gearbox values using ['brand', 'model', 'vehicletype'] (≥75% majority rule) Filled 9 missing gearbox values using ['brand', 'model'] (≥75% majority rule) ✅ Gearbox filling complete: 410 filled, 907 still missing.
df_app1 = fill_gearbox(df_app1, threshold = 0.6)
del df_app
gc.collect()
Filled 422 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥60% majority rule) Filled 165 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥60% majority rule) Filled 37 missing gearbox values using ['brand', 'model', 'fueltype'] (≥60% majority rule) Filled 21 missing gearbox values using ['brand', 'model', 'vehicletype'] (≥60% majority rule) Filled 12 missing gearbox values using ['brand', 'model'] (≥60% majority rule) Filled 15 missing gearbox values using ['brand'] (≥60% majority rule) ✅ Gearbox filling complete: 672 filled, 235 still missing.
0
gc.collect()
df_app2 = fill_all_missing_values(df_app1, threshold = 0.6)
🌀 Iteration 1 starting... → Starting with 29,117 missing values in 'power' ✅ Filled 422 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (28,695 remaining, took 4.58s) ✅ Filled 984 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (27,711 remaining, took 10.36s) ✅ Filled 205 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (27,506 remaining, took 4.03s) ✅ Filled 67 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (27,439 remaining, took 3.58s) ✅ Filled 96 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (27,343 remaining, took 7.64s) ✅ Filled 151 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (27,192 remaining, took 3.19s) ✅ Filled 56 values in 'power' using ['brand', 'model', 'vehicletype'] (27,136 remaining, took 2.86s) ✅ Filled 18 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (27,118 remaining, took 3.47s) ✅ Filled 76 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (27,042 remaining, took 3.29s) ✅ Filled 336 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (26,706 remaining, took 6.95s) ✅ Filled 65 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (26,641 remaining, took 3.12s) ✅ Filled 423 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (26,218 remaining, took 11.60s) ✅ Filled 954 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (25,264 remaining, took 29.71s) ✅ Filled 158 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (25,106 remaining, took 9.71s) ✅ Filled 13 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'] (25,093 remaining, took 7.58s) ✅ Filled 60 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (25,033 remaining, took 5.30s) ✅ Filled 2,768 values in 'power' using ['brand', 'model', 'year_bin'] (22,265 remaining, took 2.62s) ✅ Filled 146 values in 'power' using ['brand', 'model', 'registrationyear'] (22,119 remaining, took 4.93s) ✅ Filled 3,035 values in 'power' using ['brand', 'model'] (19,084 remaining, took 2.40s) ✅ Filled 2 values in 'power' using ['brand', 'vehicletype'] (19,082 remaining, took 2.44s) ✅ Filled 1 values in 'power' using ['brand', 'year_bin'] (19,081 remaining, took 2.34s) ✅ Filled 8 values in 'power' using ['brand', 'registrationyear'] (19,073 remaining, took 3.25s) → Starting with 3,228 missing values in 'vehicletype' ✅ Filled 735 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (2,493 remaining, took 10.57s) ✅ Filled 440 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (2,053 remaining, took 20.92s) ✅ Filled 311 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (1,742 remaining, took 9.16s) ✅ Filled 98 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (1,644 remaining, took 2.80s) ✅ Filled 127 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (1,517 remaining, took 5.42s) ✅ Filled 23 values in 'vehicletype' using ['brand', 'model', 'power'] (1,494 remaining, took 7.69s) ✅ Filled 136 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (1,358 remaining, took 2.72s) ✅ Filled 4 values in 'vehicletype' using ['brand', 'model'] (1,354 remaining, took 2.46s) ✅ Filled 78 values in 'vehicletype' using ['brand', 'year_bin'] (1,276 remaining, took 2.33s) ✅ Filled 53 values in 'vehicletype' using ['brand', 'registrationyear'] (1,223 remaining, took 3.48s) ✅ Filled 28 values in 'vehicletype' using ['brand', 'power'] (1,195 remaining, took 4.78s) ✅ Filled 1 values in 'vehicletype' using ['brand', 'gearbox'] (1,194 remaining, took 2.36s) → Starting with 1,639 missing values in 'model' ✅ Filled 300 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (1,339 remaining, took 10.91s) ✅ Filled 529 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (810 remaining, took 22.23s) ✅ Filled 51 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (759 remaining, took 9.21s) ✅ Filled 44 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (715 remaining, took 2.80s) ✅ Filled 221 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (494 remaining, took 5.55s) ✅ Filled 5 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (489 remaining, took 2.70s) ✅ Filled 39 values in 'model' using ['brand', 'power'] (450 remaining, took 4.72s) ✅ Filled 1 values in 'model' using ['brand', 'registrationyear'] (449 remaining, took 3.51s) ✅ Filled 1 values in 'model' using ['brand', 'gearbox'] (448 remaining, took 2.34s) → Starting with 381 missing values in 'fueltype' ✅ Filled 85 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (296 remaining, took 14.29s) ✅ Filled 206 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'registrationyear'] (90 remaining, took 26.25s) ✅ Filled 25 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (65 remaining, took 12.57s) ✅ Filled 43 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'year_bin'] (22 remaining, took 3.65s) ✅ Filled 12 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (10 remaining, took 8.19s) ✅ Filled 5 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'gearbox'] (5 remaining, took 3.27s) ✅ Filled 5 values in 'fueltype' using ['brand', 'model'] (0 remaining, took 2.44s) 🔁 Iteration 1 filled 13,650 total values 🌀 Iteration 2 starting... → Starting with 19,073 missing values in 'power' ✅ Filled 157 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (18,916 remaining, took 4.39s) ✅ Filled 104 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (18,812 remaining, took 10.00s) ✅ Filled 29 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (18,783 remaining, took 3.93s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (18,773 remaining, took 3.40s) ✅ Filled 12 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (18,761 remaining, took 7.46s) ✅ Filled 24 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (18,737 remaining, took 3.05s) ✅ Filled 155 values in 'power' using ['brand', 'model', 'vehicletype'] (18,582 remaining, took 2.73s) ✅ Filled 56 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (18,526 remaining, took 3.26s) ✅ Filled 51 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (18,475 remaining, took 6.78s) ✅ Filled 328 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (18,147 remaining, took 3.03s) ✅ Filled 40 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (18,107 remaining, took 11.15s) ✅ Filled 41 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (18,066 remaining, took 29.03s) ✅ Filled 17 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (18,049 remaining, took 9.29s) ✅ Filled 4 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'] (18,045 remaining, took 7.25s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (18,044 remaining, took 4.98s) ✅ Filled 840 values in 'power' using ['brand', 'model', 'year_bin'] (17,204 remaining, took 2.63s) ✅ Filled 1,085 values in 'power' using ['brand', 'model'] (16,119 remaining, took 2.29s) → Starting with 1,194 missing values in 'vehicletype' ✅ Filled 4 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (1,190 remaining, took 20.76s) ✅ Filled 38 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (1,152 remaining, took 9.07s) ✅ Filled 350 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (802 remaining, took 2.68s) ✅ Filled 513 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (289 remaining, took 2.67s) ✅ Filled 80 values in 'vehicletype' using ['brand', 'model'] (209 remaining, took 2.41s) → Starting with 448 missing values in 'model' ✅ Filled 46 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (402 remaining, took 10.89s) ✅ Filled 132 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (270 remaining, took 22.02s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (269 remaining, took 9.21s) 🔁 Iteration 2 filled 4,118 total values 🌀 Iteration 3 starting... → Starting with 16,119 missing values in 'power' ✅ Filled 5 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (16,114 remaining, took 9.95s) ✅ Filled 7 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (16,107 remaining, took 3.41s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (16,097 remaining, took 7.56s) ✅ Filled 694 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (15,403 remaining, took 3.11s) ✅ Filled 304 values in 'power' using ['brand', 'model', 'vehicletype'] (15,099 remaining, took 2.72s) ✅ Filled 5 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (15,094 remaining, took 3.27s) ✅ Filled 68 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (15,026 remaining, took 3.09s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (15,025 remaining, took 11.12s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (15,022 remaining, took 29.05s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (15,020 remaining, took 9.23s) ✅ Filled 149 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (14,871 remaining, took 4.93s) → Starting with 209 missing values in 'vehicletype' ✅ Filled 9 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (200 remaining, took 10.56s) ✅ Filled 2 values in 'vehicletype' using ['brand', 'power'] (198 remaining, took 4.73s) → Starting with 269 missing values in 'model' ✅ Filled 47 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (222 remaining, took 10.64s) ✅ Filled 8 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (214 remaining, took 21.97s) ✅ Filled 7 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (207 remaining, took 5.41s) ✅ Filled 4 values in 'model' using ['brand', 'power'] (203 remaining, took 4.62s) 🔁 Iteration 3 filled 1,325 total values 🌀 Iteration 4 starting... → Starting with 14,871 missing values in 'power' ✅ Filled 3 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (14,868 remaining, took 9.96s) ✅ Filled 945 values in 'power' using ['brand', 'model', 'year_bin'] (13,923 remaining, took 2.58s) ✅ Filled 1,582 values in 'power' using ['brand', 'model'] (12,341 remaining, took 2.34s) → Starting with 198 missing values in 'vehicletype' ✅ Filled 3 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (195 remaining, took 5.25s) → Starting with 203 missing values in 'model' ✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (201 remaining, took 10.71s) 🔁 Iteration 4 filled 2,535 total values 🌀 Iteration 5 starting... → Starting with 12,341 missing values in 'power' → Starting with 195 missing values in 'vehicletype' → Starting with 201 missing values in 'model' 🔁 Iteration 5 filled 0 total values 🏁 No further changes detected, stopping.
del df_app1
gc.collect()
# Only 1 lada niva marked as bus - improbable
# use df_app2[(df_app2['brand'] == 'lada') & (df_app2['model'] == 'niva') & (df_app2['vehicletype'] == 'bus')]
mask = (df_app2['brand'] == 'lada') & (df_app2['model'] == 'niva') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = 'suv'
del mask
gc.collect()
print(df_app2.memory_usage(deep=True).sum() / 1_000_000, "MB")
258.523303 MB
df_app2 = fill_gearbox(df_app2, threshold = 0.6)
Filled 107 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥60% majority rule) ✅ Gearbox filling complete: 107 filled, 128 still missing.
df_ap4 = df_app2.drop_duplicates()
df_app7 = df_ap4[df_ap4['vehicletype'].notna()]
del df_app2
gc.collect()
0
df_app8 = df_app7[df_app7['gearbox'].notna()]
del df_ap4
gc.collect()
0
def fill_all_missing_values_mp(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
"""
Fill missing values for power and model using tiered group strategies.
Optimized version with better memory management and early stopping.
"""
df = df.copy()
def safe_mode(series):
"""Return mode if confident enough (>= threshold), else NaN."""
s = series.dropna()
if len(s) == 0:
return np.nan
counts = s.value_counts(normalize=True)
if len(counts) == 0:
return np.nan
top_val, top_freq = counts.index[0], counts.iloc[0]
return top_val if top_freq >= threshold else np.nan
def is_zero_condition(condition):
"""Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
try:
test = condition(pd.Series([0, np.nan], dtype=object))
if isinstance(test, (bool, np.bool_)) and test:
return True
if hasattr(test, "__len__") and len(test) >= 1:
return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
except Exception:
pass
return False
def make_key_tuple(row_vals):
"""Helper: convert list-like row values to a hashable tuple with None for NaN."""
return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)
def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
total_filled = 0
zero_check = is_zero_condition(condition)
# Track initial state
if zero_check:
initial_missing = (df[target_col] == 0).sum()
else:
initial_missing = df[target_col].isna().sum()
if initial_missing == 0:
return 0
if verbose:
print(f" → Starting with {initial_missing:,} missing values in '{target_col}'")
for cols in fill_strategies:
# Check if there's still work to do
if zero_check:
current_missing = (df[target_col] == 0).sum()
else:
current_missing = df[target_col].isna().sum()
if current_missing == 0:
break
start_time = time.time()
try:
# Compute group modes using safe_mode
group_modes = (
df.groupby(cols, dropna=False)[target_col]
.apply(safe_mode)
.reset_index()
.rename(columns={target_col: 'fill_value'})
)
# Remove groups with no valid fill value
group_modes = group_modes[group_modes['fill_value'].notna()]
if len(group_modes) == 0:
continue
except Exception as e:
if verbose:
print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
continue
# Build mapping dict from group_modes
keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
mapping = dict(zip(keys, group_modes['fill_value'].values))
# Compute fill_value per-row by mapping (keeps original row order)
row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
fill_series = row_keys.map(mapping)
# Create mask of rows that need filling AND have a candidate fill_value
mask_need = condition(df[target_col])
mask_candidate = fill_series.notna()
mask = mask_need & mask_candidate
# Count before
if zero_check:
before_missing = (df[target_col] == 0).sum()
else:
before_missing = df[target_col].isna().sum()
# Perform fill
if mask.any():
df.loc[mask, target_col] = fill_series.loc[mask].values
# Count after
if zero_check:
after_missing = (df[target_col] == 0).sum()
else:
after_missing = df[target_col].isna().sum()
filled_now = before_missing - after_missing
total_filled += int(filled_now)
if verbose and filled_now > 0:
elapsed = time.time() - start_time
print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")
return total_filled
iteration = 0
while iteration < max_iterations:
iteration += 1
total_filled = 0
if verbose:
print(f"\n🌀 Iteration {iteration} starting...")
# --- POWER ---
power_strategies = [
['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype', 'year_bin'],
['brand', 'model', 'vehicletype', 'registrationyear'],
['brand', 'model', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype'],
['brand', 'model', 'fueltype', 'vehicletype'],
['brand', 'model', 'fueltype', 'year_bin'],
['brand', 'model', 'fueltype', 'registrationyear'],
['brand', 'model', 'fueltype', 'gearbox'],
['brand','model','vehicletype','fueltype','year_bin','pc_bin'],
['brand','model','vehicletype','fueltype','registrationyear','pc_bin'],
['brand','model','vehicletype','fueltype','gearbox','pc_bin'],
['brand','model','vehicletype','fueltype','pc_bin'],
['brand','model','vehicletype','pc_bin'],
['brand', 'model', 'year_bin'],
['brand', 'model', 'registrationyear'],
['brand', 'model'],
['brand', 'vehicletype'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)
# --- MODEL ---
model_strategies = [
['brand', 'vehicletype', 'power', 'year_bin'],
['brand', 'vehicletype', 'power', 'registrationyear'],
['brand', 'vehicletype', 'power', 'gearbox'],
['brand', 'vehicletype', 'year_bin'],
['brand', 'vehicletype', 'registrationyear'],
['brand', 'vehicletype', 'power'],
['brand', 'vehicletype', 'gearbox'],
['brand', 'vehicletype'],
['brand', 'power'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('model', model_strategies)
if verbose:
print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")
if not repeat_until_change or total_filled == 0:
if verbose:
print("🏁 No further changes detected, stopping.")
break
return df
gc.collect()
df_app10 = fill_all_missing_values_mp(df_app8, threshold = 0.55)
🌀 Iteration 1 starting... → Starting with 12,185 missing values in 'power' ✅ Filled 134 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (12,051 remaining, took 4.35s) ✅ Filled 390 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (11,661 remaining, took 10.29s) ✅ Filled 178 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (11,483 remaining, took 3.84s) ✅ Filled 9 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (11,474 remaining, took 3.34s) ✅ Filled 72 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (11,402 remaining, took 7.45s) ✅ Filled 111 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (11,291 remaining, took 3.13s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype'] (11,290 remaining, took 2.73s) ✅ Filled 7 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (11,283 remaining, took 3.30s) ✅ Filled 79 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (11,204 remaining, took 3.27s) ✅ Filled 112 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (11,092 remaining, took 6.80s) ✅ Filled 9 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (11,083 remaining, took 3.09s) ✅ Filled 149 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (10,934 remaining, took 11.05s) ✅ Filled 205 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (10,729 remaining, took 29.02s) ✅ Filled 83 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (10,646 remaining, took 9.25s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'] (10,644 remaining, took 7.21s) ✅ Filled 5 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (10,639 remaining, took 5.06s) ✅ Filled 138 values in 'power' using ['brand', 'model', 'registrationyear'] (10,501 remaining, took 4.92s) ✅ Filled 1 values in 'power' using ['brand', 'year_bin'] (10,500 remaining, took 2.32s) ✅ Filled 4 values in 'power' using ['brand', 'registrationyear'] (10,496 remaining, took 3.27s) → Starting with 201 missing values in 'model' ✅ Filled 6 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (195 remaining, took 10.70s) ✅ Filled 52 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (143 remaining, took 22.01s) ✅ Filled 78 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (65 remaining, took 9.10s) ✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (63 remaining, took 2.70s) ✅ Filled 9 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (54 remaining, took 5.36s) ✅ Filled 4 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (50 remaining, took 2.56s) 🔁 Iteration 1 filled 1,840 total values 🌀 Iteration 2 starting... → Starting with 10,496 missing values in 'power' ✅ Filled 30 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (10,466 remaining, took 4.33s) ✅ Filled 61 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (10,405 remaining, took 10.07s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (10,395 remaining, took 3.85s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (10,385 remaining, took 3.39s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (10,375 remaining, took 7.39s) ✅ Filled 220 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (10,155 remaining, took 3.16s) ✅ Filled 6 values in 'power' using ['brand', 'model', 'vehicletype'] (10,149 remaining, took 2.71s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (10,139 remaining, took 3.32s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (10,138 remaining, took 6.84s) ✅ Filled 29 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (10,109 remaining, took 3.04s) ✅ Filled 37 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (10,072 remaining, took 11.07s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (10,071 remaining, took 28.97s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (10,070 remaining, took 9.26s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (10,069 remaining, took 4.96s) ✅ Filled 589 values in 'power' using ['brand', 'model', 'year_bin'] (9,480 remaining, took 2.68s) ✅ Filled 723 values in 'power' using ['brand', 'model'] (8,757 remaining, took 2.36s) ✅ Filled 5 values in 'power' using ['brand', 'year_bin'] (8,752 remaining, took 2.19s) → Starting with 50 missing values in 'model' ✅ Filled 12 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (38 remaining, took 22.08s) 🔁 Iteration 2 filled 1,756 total values 🌀 Iteration 3 starting... → Starting with 8,752 missing values in 'power' ✅ Filled 2 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (8,750 remaining, took 9.95s) → Starting with 38 missing values in 'model' 🔁 Iteration 3 filled 2 total values 🌀 Iteration 4 starting... → Starting with 8,750 missing values in 'power' → Starting with 38 missing values in 'model' 🔁 Iteration 4 filled 0 total values 🏁 No further changes detected, stopping.
df_app12 = df_app10[df_app10['model'].notna()]
gc.collect()
0
def fill_missing_power(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
"""
Fill missing power values (where power == 0) using tiered group strategies.
Optimized version with better memory management and early stopping.
"""
df = df.copy()
def safe_mode(series):
"""Return mode if confident enough (>= threshold), else NaN."""
s = series.dropna()
if len(s) == 0:
return np.nan
counts = s.value_counts(normalize=True)
if len(counts) == 0:
return np.nan
top_val, top_freq = counts.index[0], counts.iloc[0]
return top_val if top_freq >= threshold else np.nan
def is_zero_condition(condition):
"""Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
try:
test = condition(pd.Series([0, np.nan], dtype=object))
if isinstance(test, (bool, np.bool_)) and test:
return True
if hasattr(test, "__len__") and len(test) >= 1:
return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
except Exception:
pass
return False
def make_key_tuple(row_vals):
"""Helper: convert list-like row values to a hashable tuple with None for NaN."""
return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)
def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
total_filled = 0
zero_check = is_zero_condition(condition)
# Track initial state
if zero_check:
initial_missing = (df[target_col] == 0).sum()
else:
initial_missing = df[target_col].isna().sum()
if initial_missing == 0:
return 0
if verbose:
print(f" → Starting with {initial_missing:,} missing values in '{target_col}'")
for cols in fill_strategies:
# Check if there's still work to do
if zero_check:
current_missing = (df[target_col] == 0).sum()
else:
current_missing = df[target_col].isna().sum()
if current_missing == 0:
break
start_time = time.time()
try:
# Compute group modes using safe_mode
group_modes = (
df.groupby(cols, dropna=False)[target_col]
.apply(safe_mode)
.reset_index()
.rename(columns={target_col: 'fill_value'})
)
# Remove groups with no valid fill value
group_modes = group_modes[group_modes['fill_value'].notna()]
if len(group_modes) == 0:
continue
except Exception as e:
if verbose:
print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
continue
# Build mapping dict from group_modes
keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
mapping = dict(zip(keys, group_modes['fill_value'].values))
# Compute fill_value per-row by mapping (keeps original row order)
row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
fill_series = row_keys.map(mapping)
# Create mask of rows that need filling AND have a candidate fill_value
mask_need = condition(df[target_col])
mask_candidate = fill_series.notna()
mask = mask_need & mask_candidate
# Count before
if zero_check:
before_missing = (df[target_col] == 0).sum()
else:
before_missing = df[target_col].isna().sum()
# Perform fill
if mask.any():
df.loc[mask, target_col] = fill_series.loc[mask].values
# Count after
if zero_check:
after_missing = (df[target_col] == 0).sum()
else:
after_missing = df[target_col].isna().sum()
filled_now = before_missing - after_missing
total_filled += int(filled_now)
if verbose and filled_now > 0:
elapsed = time.time() - start_time
print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")
return total_filled
iteration = 0
while iteration < max_iterations:
iteration += 1
total_filled = 0
if verbose:
print(f"\n🌀 Iteration {iteration} starting...")
# --- POWER ONLY ---
power_strategies = [
['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'],
['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'],
['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'],
['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'],
['brand', 'model', 'vehicletype', 'pc_bin'],
['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype', 'year_bin'],
['brand', 'model', 'vehicletype', 'registrationyear'],
['brand', 'model', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype'],
['brand', 'model', 'fueltype', 'vehicletype'],
['brand', 'model', 'fueltype', 'year_bin'],
['brand', 'model', 'fueltype', 'registrationyear'],
['brand', 'model', 'fueltype', 'gearbox'],
['brand', 'model', 'year_bin'],
['brand', 'model', 'registrationyear'],
['brand', 'model', 'pc_bin'],
['brand', 'model'],
['brand', 'vehicletype'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand', 'pc_bin'],
['brand']
]
total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)
if verbose:
print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")
if not repeat_until_change or total_filled == 0:
if verbose:
print("🏁 No further changes detected, stopping.")
break
return df
gc.collect()
df_app13 = fill_missing_power(df_app12, threshold = 0.51)
🌀 Iteration 1 starting... → Starting with 8,745 missing values in 'power' ✅ Filled 86 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (8,659 remaining, took 11.04s) ✅ Filled 154 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (8,505 remaining, took 28.86s) ✅ Filled 40 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (8,465 remaining, took 9.25s) ✅ Filled 6 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'] (8,459 remaining, took 7.17s) ✅ Filled 11 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (8,448 remaining, took 4.88s) ✅ Filled 77 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (8,371 remaining, took 4.30s) ✅ Filled 235 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (8,136 remaining, took 9.97s) ✅ Filled 47 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (8,089 remaining, took 3.80s) ✅ Filled 12 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (8,077 remaining, took 3.29s) ✅ Filled 77 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (8,000 remaining, took 7.45s) ✅ Filled 7 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (7,993 remaining, took 3.01s) ✅ Filled 12 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (7,981 remaining, took 3.20s) ✅ Filled 150 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (7,831 remaining, took 6.76s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (7,830 remaining, took 3.01s) ✅ Filled 51 values in 'power' using ['brand', 'model', 'registrationyear'] (7,779 remaining, took 4.94s) ✅ Filled 22 values in 'power' using ['brand', 'model', 'pc_bin'] (7,757 remaining, took 3.27s) ✅ Filled 3 values in 'power' using ['brand', 'vehicletype'] (7,754 remaining, took 2.27s) ✅ Filled 9 values in 'power' using ['brand', 'registrationyear'] (7,745 remaining, took 3.21s) ✅ Filled 1 values in 'power' using ['brand', 'pc_bin'] (7,744 remaining, took 2.33s) 🔁 Iteration 1 filled 1,001 total values 🌀 Iteration 2 starting... → Starting with 7,744 missing values in 'power' ✅ Filled 6 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (7,738 remaining, took 11.06s) ✅ Filled 8 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (7,730 remaining, took 9.13s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (7,729 remaining, took 4.99s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (7,726 remaining, took 3.84s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (7,725 remaining, took 6.78s) 🔁 Iteration 2 filled 19 total values 🌀 Iteration 3 starting... → Starting with 7,725 missing values in 'power' 🔁 Iteration 3 filled 0 total values 🏁 No further changes detected, stopping.
del df_app7,df_app8,df_app10,df_app12
gc.collect()
print(df_app13.memory_usage(deep=True).sum() / 1_000_000, "MB")
258.253213 MB
df_app14 = df_app13[df_app13['power'] != 0]
df_app15 = df_app14.drop(columns = ['year_bin','pc_bin', 'registration_correction'])
df_app15['notrepaired'] = df_app15['notrepaired'].fillna('unknown')
df_app15.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 329992 entries, 0 to 338096 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 329992 non-null object 1 price 329992 non-null int64 2 vehicletype 329992 non-null object 3 registrationyear 329992 non-null float64 4 gearbox 329992 non-null object 5 power 329992 non-null float64 6 model 329992 non-null object 7 mileage 329992 non-null int64 8 registrationmonth 329992 non-null int64 9 fueltype 329992 non-null object 10 brand 329992 non-null object 11 notrepaired 329992 non-null object 12 datecreated 329992 non-null datetime64[ns] 13 numberofpictures 329992 non-null int64 14 postalcode 329992 non-null int64 15 lastseen 329992 non-null object dtypes: datetime64[ns](1), float64(2), int64(5), object(8) memory usage: 42.8+ MB
df_app15.to_pickle('checkpoint_02.pkl')
petrol = (df_app15['fueltype'] == 'gasoline')
df_app15.loc[petrol,['fueltype']] = 'petrol'
petrol = (df_app15['fueltype'].isna())
df_app15.loc[petrol,['fueltype']] = 'petrol'
del petrol
DType Clean Up¶
# 1. Fix datetime columns
date_cols = ['datecrawled', 'lastseen']
for col in date_cols:
df_app15[col] = pd.to_datetime(df_app15[col], errors='coerce')
# 2. Convert numeric columns to efficient types
# registrationyear & power should not be floats
df_app15['registrationyear'] = df_app15['registrationyear'].astype('int')
df_app15['power'] = df_app15['power'].astype('int')
# 3. Clean up memory
gc.collect()
print("Final memory usage:", df_app15.memory_usage(deep=True).sum() / 1_000_000, "MB")
print(df_app15.dtypes)
Final memory usage: 152.2319 MB datecrawled datetime64[ns] price int64 vehicletype object registrationyear int64 gearbox object power int64 model object mileage int64 registrationmonth int64 fueltype object brand object notrepaired object datecreated datetime64[ns] numberofpictures int64 postalcode int64 lastseen datetime64[ns] dtype: object
df_app15['datecrawled_year'] = df_app15['datecrawled'].dt.year
df_app15['datecrawled_month'] = df_app15['datecrawled'].dt.month.astype('object')
df_app15['datecreated_year'] = df_app15['datecreated'].dt.year
df_app15['datecreated_month'] = df_app15['datecreated'].dt.month.astype('object')
df_app15['lastseen_year'] = df_app15['lastseen'].dt.year
df_app15['lastseen_month'] = df_app15['lastseen'].dt.month.astype('object')
df_app15['postalcode'] = df_app15['postalcode'].astype('object')
df_app15['registrationmonth'] = df_app15['registrationmonth'].astype('object')
df_app15.insert(df_app15.columns.get_loc("datecrawled"), "datecrawled_month", df_app15.pop("datecrawled_month"))
df_app15.insert(df_app15.columns.get_loc("datecrawled") + 1, "datecrawled_year", df_app15.pop("datecrawled_year"))
df_app15.insert(df_app15.columns.get_loc("datecreated"), "datecreated_month", df_app15.pop("datecreated_month"))
df_app15.insert(df_app15.columns.get_loc("datecreated") + 1, "datecreated_year", df_app15.pop("datecreated_year"))
df_app15.insert(df_app15.columns.get_loc("lastseen"), "lastseen_month", df_app15.pop("lastseen_month"))
df_app15.insert(df_app15.columns.get_loc("lastseen") + 1, "lastseen_year", df_app15.pop("lastseen_year"))
df_app15 = df_app15.drop(columns=['datecrawled', 'datecreated', 'lastseen'])
df_app15.info()
gc.collect()
<class 'pandas.core.frame.DataFrame'> Int64Index: 329992 entries, 0 to 338096 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled_month 329992 non-null object 1 datecrawled_year 329992 non-null int64 2 price 329992 non-null int64 3 vehicletype 329992 non-null object 4 registrationyear 329992 non-null int64 5 gearbox 329992 non-null object 6 power 329992 non-null int64 7 model 329992 non-null object 8 mileage 329992 non-null int64 9 registrationmonth 329992 non-null object 10 fueltype 329992 non-null object 11 brand 329992 non-null object 12 notrepaired 329992 non-null object 13 datecreated_month 329992 non-null object 14 datecreated_year 329992 non-null int64 15 numberofpictures 329992 non-null int64 16 postalcode 329992 non-null object 17 lastseen_month 329992 non-null object 18 lastseen_year 329992 non-null int64 dtypes: int64(8), object(11) memory usage: 50.4+ MB
0
DataFrame Comparison¶
coupe = df1[df1['VehicleType'] == 'coupe']
suv = df1[df1['VehicleType'] == 'suv']
small = df1[df1['VehicleType'] == 'small']
sedan = df1[df1['VehicleType'] == 'sedan']
convertible = df1[df1['VehicleType'] == 'convertible']
bus = df1[df1['VehicleType'] == 'bus']
wagon = df1[df1['VehicleType'] == 'wagon']
ncoupe = df_app15[df_app15['vehicletype'] == 'coupe']
nsuv = df_app15[df_app15['vehicletype'] == 'suv']
nsmall = df_app15[df_app15['vehicletype'] == 'small']
nsedan = df_app15[df_app15['vehicletype'] == 'sedan']
nconvertible = df_app15[df_app15['vehicletype'] == 'convertible']
nbus = df_app15[df_app15['vehicletype'] == 'bus']
nwagon = df_app15[df_app15['vehicletype'] == 'wagon']
coupe['Brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Coupes per Brand: Before Data Cleaning')
plt.show()
ncoupe['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Coupes per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=coupe, x='Price', y='Brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Coupe Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=ncoupe, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Coupe Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del coupe
del ncoupe
suv['Brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of SUVs per Brand: Before Data Cleaning')
plt.show()
nsuv['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of SUVs per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=suv, x='Price', y='Brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of SUVs Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nsuv, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of SUVs Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del suv
del nsuv
small['Brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Small per Brand: Before Data Cleaning')
plt.show()
nsmall['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Small per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=small, x='Price', y='Brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Small Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nsmall, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Small Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del small
del nsmall
sedan['Brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Sedan per Brand: Before Data Cleaning')
plt.show()
nsedan['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Sedan per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=sedan, x='Price', y='Brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Sedan Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nsedan, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Sedan Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del sedan
del nsedan
convertible['Brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Convertibles per Brand: Before Data Cleaning')
plt.show()
nconvertible['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Convertibless per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=convertible, x='Price', y='Brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Convertibles Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nconvertible, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Convertible Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del convertible
del nconvertible
bus['Brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Buses per Brand: Before Data Cleaning')
plt.show()
nbus['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Buses per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=bus, x='Price', y='Brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Bus Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nbus, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Bus Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del bus
del nbus
wagon['Brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand: Before Data Cleaning')
plt.show()
nwagon['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=wagon, x='Price', y='Brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nwagon, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
df1['Price'].hist(bins=20)
plt.show()
df_app15['price'].hist(bins=20)
plt.show()
del df_app13,df_app14,df_car,df_app3,df_reg,df_vt,df_model,df_ft,wagon,nwagon
gc.collect()
59778
Model Training¶
import sys
def show_memory_usage():
vars_list = []
for name, obj in globals().items():
if not name.startswith('_'):
size_mb = sys.getsizeof(obj) / (1024**2)
if size_mb > 1: # Only show objects > 1MB
vars_list.append((name, size_mb, type(obj).__name__))
vars_list.sort(key=lambda x: x[1], reverse=True)
print("\n🔍 Memory Usage:")
for name, size, dtype in vars_list[:10]:
print(f" {name}: {size:.2f} MB ({dtype})")
# Use it throughout your notebook
show_memory_usage()
🔍 Memory Usage: df: 221.81 MB (DataFrame) df1: 213.99 MB (DataFrame) df_app15: 196.68 MB (DataFrame) passat: 11.10 MB (Series) qashqai: 11.10 MB (Series) ft: 11.10 MB (Series) vw_model: 4.10 MB (DataFrame) fix80: 3.01 MB (Series) fix90: 3.01 MB (Series) fix00: 3.01 MB (Series)
data = df_app15.copy()
del df_app15,passat, qashqai, ft, vw_model, fix80, fix90, fix00, fix10, cvt, gbus, puv, bmwsuv, astrabus, nosuv, noconvertible, nocoupe, nobus, nowagon, nosedan, nosmall, noaudi
gc.collect()
0
# If the kernel crashes:
# import libraries (Go to the top - press ctrl+F and type libraries to get there faster - run the libraries)
# data = pd.read_pickle('checkpoint_03.pkl') <-- copy this on a new line right below, run it
# This is a checkpoint to start off with the data DF
data.to_pickle('checkpoint_03.pkl')
Train/Validate Split¶
features = data.drop('price', axis=1)
target = data['price']
features_train, features_valid, target_train, target_valid = train_test_split(
features, target,
test_size=0.25,
random_state=12345
)
# Identify categorical columns
cat_cols = features_train.select_dtypes(include=['object','category']).columns
num_cols = features_train.select_dtypes(exclude=['object','category']).columns
features_train = features_train.copy()
features_valid = features_valid.copy()
features_train.loc[:, cat_cols] = features_train[cat_cols].astype(str)
features_valid.loc[:, cat_cols] = features_valid[cat_cols].astype(str)
def evaluate_model(name, model, features_train, target_train, features_valid, target_valid, cat_features=None):
print(f"\nTraining {name}...")
start_train = time.time()
if cat_features is not None:
model.fit(features_train, target_train, cat_features=cat_features)
else:
model.fit(features_train, target_train)
train_time = time.time() - start_train
start_pred = time.time()
preds = model.predict(features_valid)
pred_time = time.time() - start_pred
rmse = mean_squared_error(target_valid, preds, squared=False)
print(f"{name}: RMSE={rmse:.3f}, TrainTime={train_time:.2f}s, PredTime={pred_time:.4f}s")
return {
'Model': name,
'RMSE': rmse,
'Train_Time': train_time,
'Predict_Time': pred_time
}
gc.collect()
0
ohe_processor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(handle_unknown='ignore', dtype = int), cat_cols)
],
remainder='passthrough'
)
lr_model = Pipeline([
('ohe', ohe_processor),
('lr', LinearRegression())
])
results = []
results.append(
evaluate_model('Linear Regression Model', lr_model, features_train, target_train, features_valid, target_valid)
)
Training Linear Regression Model... Linear Regression Model: RMSE=2849.125, TrainTime=1.19s, PredTime=0.2909s
gc.collect()
52
# DecisionTree
dt_model = Pipeline([
('ohe', ohe_processor),
('dt', DecisionTreeRegressor(
max_depth=20,
min_samples_leaf=4,
random_state=12345
))
])
results.append(
evaluate_model('Decision Tree Model', dt_model, features_train, target_train, features_valid, target_valid)
)
Training Decision Tree Model... Decision Tree Model: RMSE=1860.135, TrainTime=28.17s, PredTime=0.2242s
gc.collect()
52
# Random Forest
rf_model = Pipeline([
('ohe', ohe_processor),
('rf', RandomForestRegressor(
n_estimators=100,
max_depth=20,
random_state=12345,
n_jobs=-1
))
])
results.append(
evaluate_model('Random Forest', rf_model, features_train, target_train, features_valid, target_valid)
)
Training Random Forest... Random Forest: RMSE=1676.330, TrainTime=1138.89s, PredTime=0.7616s
gc.collect()
27
# CATBOOST - Set 1 (Baseline)
cat_features = [features_train.columns.get_loc(c) for c in cat_cols]
cat_model = CatBoostRegressor(
depth=8,
learning_rate=0.1,
iterations=500,
loss_function='RMSE',
verbose=False,
random_seed=12345
)
results.append(
evaluate_model(
'CatBoost Set 1',
cat_model,
features_train,
target_train,
features_valid,
target_valid,
cat_features=cat_features
)
)
# CATBOOST — Set 2 (More complex)
cat_model2 = CatBoostRegressor(
depth=10, # deeper trees
learning_rate=0.03, # slower learning
iterations=800, # more boosting rounds
l2_leaf_reg=5, # L2 regularization
random_strength=1.5, # helps avoid overfitting
loss_function='RMSE',
verbose=False,
random_seed=12345
)
results.append(
evaluate_model(
'CatBoost Set 2',
cat_model2,
features_train,
target_train,
features_valid,
target_valid,
cat_features=cat_features
)
)
Training CatBoost Set 1... CatBoost Set 1: RMSE=1633.907, TrainTime=243.91s, PredTime=0.7010s Training CatBoost Set 2... CatBoost Set 2: RMSE=1641.981, TrainTime=648.14s, PredTime=1.3688s
cat_cols = list(cat_cols)
for col in cat_cols:
features_train[col] = features_train[col].astype("category")
features_valid[col] = features_valid[col].astype("category")
gc.collect()
0
# XGBOOST
# Set 2 parameters to compare results; set 1 is a baseline and set 2 is more in depth
# xgb_model is a baseline
xgb_model = Pipeline(steps=[
('preprocess', ohe_processor),
('model', XGBRegressor(
n_estimators=400,
learning_rate=0.05,
max_depth=8,
subsample=0.8,
colsample_bytree=0.8,
random_state=12345,
objective='reg:squarederror',
n_jobs=-1
))
])
results.append(
evaluate_model(
"XGBoost Set 1",
xgb_model,
features_train, target_train, features_valid, target_valid)
)
xgb_model2 = Pipeline(steps=[
('preprocess', ohe_processor),
('model', XGBRegressor(
n_estimators= 600,
learning_rate= 0.03, # slower learning
max_depth= 10, # deeper, more complex
subsample= 0.7, # stronger regularization
colsample_bytree= 0.7,
min_child_weight=5, # added regularization
gamma=0.3, # added regularization
random_state= 12345,
objective= 'reg:squarederror',
n_jobs= -1
))
])
results.append(
evaluate_model(
"XGBoost Set 2",
xgb_model2,
features_train, target_train, features_valid, target_valid)
)
Training XGBoost Set 1... XGBoost Set 1: RMSE=1643.115, TrainTime=177.35s, PredTime=1.0357s Training XGBoost Set 2... XGBoost Set 2: RMSE=1606.470, TrainTime=418.01s, PredTime=2.6917s
gc.collect()
143
# LightGBM datasets
lgb_train = lgb.Dataset(
features_train,
label=target_train
)
lgb_valid = lgb.Dataset(
features_valid,
label=target_valid,
reference=lgb_train
)
# LightGBM Set 1
# Set 1 is a conservative configuration with lower num_leaves and a smaller learning rate.
# It serves as a baseline tuned model.
#By comparing these two sets, I can analyze how parameter changes influence the model and select the best-performing configuration.
params_set1 = {
'objective': 'regression',
'metric': 'rmse',
'num_leaves': 31,
'learning_rate': 0.05,
'verbose': -1
}
print("\nTraining LightGBM (Set 1)...")
start1 = time.time()
lgb_model1 = lgb.train(
params_set1,
lgb_train,
valid_sets=[lgb_valid],
num_boost_round=300,
callbacks=[lgb.early_stopping(stopping_rounds=50)]
)
train_time1 = time.time() - start1
start_pred1 = time.time()
preds1 = lgb_model1.predict(features_valid)
pred_time1 = time.time() - start_pred1
rmse1 = mean_squared_error(target_valid, preds1, squared=False)
results.append({
'Model': 'LightGBM Set 1',
'RMSE': rmse1,
'Boosting_Rounds': lgb_model1.best_iteration,
'Train_Time': train_time1,
'Predict_Time': pred_time1
})
# LightGBM Set 2
# Set 2 increases model complexity (higher num_leaves) and uses a larger learning rate and more boosting rounds.
# This helps evaluate whether deeper, more aggressive boosting improves performance.
params_set2 = {
'objective': 'regression',
'metric': 'rmse',
'num_leaves': 64,
'learning_rate': 0.1,
'verbose': -1
}
print("\nTraining LightGBM (Set 2)...")
start2 = time.time()
lgb_model2 = lgb.train(
params_set2,
lgb_train,
valid_sets=[lgb_valid],
num_boost_round=500,
callbacks=[lgb.early_stopping(stopping_rounds=50)]
)
train_time2 = time.time() - start2
start_pred2 = time.time()
preds2 = lgb_model2.predict(features_valid)
pred_time2 = time.time() - start_pred2
rmse2 = mean_squared_error(target_valid, preds2, squared=False)
results.append({
'Model': 'LightGBM Set 2',
'RMSE': rmse2,
'Boosting_Rounds': lgb_model2.best_iteration,
'Train_Time': train_time2,
'Predict_Time': pred_time2
})
print(f"LightGBM Set 1: RMSE={rmse1:.3f}, TrainTime={train_time1:.2f}, PredTime={pred_time1:.2f}")
print(f"LightGBM Set 2: RMSE={rmse2:.3f}TrainTime={train_time2:.2f}, PredTime={pred_time2:.2f}")
Training LightGBM (Set 1)...
/.venv/lib/python3.9/site-packages/lightgbm/basic.py:1780: UserWarning: Overriding the parameters from Reference Dataset.
_log_warning('Overriding the parameters from Reference Dataset.')
/.venv/lib/python3.9/site-packages/lightgbm/basic.py:1513: UserWarning: categorical_column in param dict is overridden.
_log_warning(f'{cat_alias} in param dict is overridden.')
Training until validation scores don't improve for 50 rounds Did not meet early stopping. Best iteration is: [300] valid_0's rmse: 1670.76 Training LightGBM (Set 2)... Training until validation scores don't improve for 50 rounds Did not meet early stopping. Best iteration is: [496] valid_0's rmse: 1615.9 LightGBM Set 1: RMSE=1670.764, TrainTime=132.12, PredTime=1.51 LightGBM Set 2: RMSE=1615.896TrainTime=51.72, PredTime=4.73
Model analysis¶
# RESULTS TABLE
results_df = pd.DataFrame(results)
results_df.sort_values(by='RMSE', inplace=True)
results_df.reset_index(drop=True, inplace=True)
print("\n\nFINAL MODEL COMPARISON:")
print(results_df.to_string())
FINAL MODEL COMPARISON:
Model RMSE Train_Time Predict_Time Boosting_Rounds
0 XGBoost Set 2 1606.470478 418.009612 2.691655 NaN
1 LightGBM Set 2 1615.896225 51.715416 4.728766 496.0
2 CatBoost Set 1 1633.907363 243.907404 0.700982 NaN
3 CatBoost Set 2 1641.981383 648.135489 1.368841 NaN
4 XGBoost Set 1 1643.114565 177.347155 1.035701 NaN
5 LightGBM Set 1 1670.763510 132.116036 1.507675 300.0
6 Random Forest 1676.329586 1138.894385 0.761605 NaN
7 Decision Tree Model 1860.134819 28.170426 0.224231 NaN
8 Linear Regression Model 2849.125340 1.188438 0.290866 NaN
Final Conclusion¶
This project successfully developed and evaluated multiple machine learning models to predict used car prices for Rusty Bargain's mobile application. The analysis focused on three critical metrics: prediction quality (RMSE), prediction speed, and training time.
Key Findings¶
Best Overall Model: XGBoost Set 2
- Achieved the lowest RMSE of 1,606.47 euros, representing the most accurate predictions
- Demonstrated moderate training time (approximately 418 seconds) and fast prediction speed (approximately 2.7 seconds)
- Delivered superior accuracy without requiring manual boosting round tuning
Model Performance Summary:
- Top performers (RMSE < 1,650): XGBoost Set 2 (1,606.47), LightGBM Set 2 (1,615.90), CatBoost Set 1 (1,633.91), and CatBoost Set 2 (1,641.98) all delivered excellent predictive accuracy
- LightGBM Set 2 offered the best balance of accuracy and training efficiency with only 9 euros higher RMSE (1,615.90) while training in just 52 seconds with 496 boosting rounds
- CatBoost Set 1 provided the fastest prediction time (0.7 seconds) with strong accuracy (1,633.91 RMSE), making it ideal for real-time applications
- Random Forest achieved competitive accuracy (1,676.33 RMSE) but required significantly longer training time (1,139 seconds)
- Linear Regression baseline achieved 2,849.13 RMSE, confirming that gradient boosting methods improved accuracy by approximately 44%
Trade-offs Analysis¶
For Production Deployment:
- If prediction speed is critical: CatBoost Set 1 delivers sub-second predictions (0.7s) with only 27 euros higher RMSE than the best model
- If accuracy is paramount: XGBoost Set 2 provides the best predictions with acceptable training and prediction times
- If training efficiency matters: LightGBM Set 2 offers near-identical accuracy (9 euro difference) while training 6x faster (52 vs 308 seconds)
Technical Approach¶
The project successfully:
- Cleaned and preprocessed 320,000+ records with extensive missing value imputation using hierarchical grouping strategies
- Implemented proper categorical encoding (label encoding for LightGBM/CatBoost, one-hot encoding for XGBoost)
- Validated that gradient boosting methods significantly outperformed traditional algorithms
- Achieved prediction errors under 1,650 euros for the top four models
Recommendation¶
For Rusty Bargain's mobile application, I recommend deploying XGBoost Set 2 as the primary model. With an RMSE of 1,606.47 euros, predictions are typically within this margin of actual prices—excellent accuracy for a used car valuation tool. The 2.7-second prediction time provides responsive user experience while the 418-second training time is acceptable for periodic model updates.
Alternative option: LightGBM Set 2 serves as a strong alternative if training efficiency becomes important (e.g., frequent model retraining). With only 9 euros higher error and 6x faster training, it offers nearly identical user-facing performance with significant operational advantages.
Speed-optimized option: If sub-second response times become critical during high-traffic periods, CatBoost Set 1 provides 0.7-second predictions with RMSE of 1,633.91—only 27 euros less accurate than the best model while being 4x faster.
The gradient boosting approaches demonstrated clear superiority over traditional methods, justifying their computational requirements for this business application where prediction accuracy directly impacts customer trust and satisfaction.
Checklist¶
Type 'x' to check. Then press Shift+Enter.
- Jupyter Notebook is open
- Code is error free
- The cells with the code have been arranged in order of execution
- The data has been downloaded and prepared
- The models have been trained
- The analysis of speed and quality of the models has been performed